[v3,4/8] unpack-trees: fix nested sparse-dir search

Message ID	10bcadb284e49419f9b4baf75e05f719ec395d98.1629206603.git.gitgitgadget@gmail.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <git-owner@kernel.org> Message-Id: <10bcadb284e49419f9b4baf75e05f719ec395d98.1629206603.git.gitgitgadget@gmail.com> In-Reply-To: <pull.1009.v3.git.1629206602.gitgitgadget@gmail.com> References: <pull.1009.v2.git.1628625013.gitgitgadget@gmail.com> <pull.1009.v3.git.1629206602.gitgitgadget@gmail.com> Date: Tue, 17 Aug 2021 13:23:18 +0000 Subject: [PATCH v3 4/8] unpack-trees: fix nested sparse-dir search Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 To: git@vger.kernel.org Cc: gitster@pobox.com, newren@gmail.com, matheus.bernardino@usp.br, stolee@gmail.com, Derrick Stolee <derrickstolee@github.com>, Derrick Stolee <dstolee@microsoft.com> Precedence: bulk From: Derrick Stolee <dstolee@microsoft.com>
Series	Sparse index: delete ignored files outside sparse cone \| expand [v3,0/8] Sparse index: delete ignored files outside sparse cone [v3,1/8] t7519: rewrite sparse index test [v3,2/8] sparse-index: silently return when not using cone-mode patterns [v3,3/8] sparse-index: silently return when cache tree fails [v3,4/8] unpack-trees: fix nested sparse-dir search [v3,5/8] sparse-checkout: create helper methods [v3,6/8] attr: be careful about sparse directories [v3,7/8] sparse-index: add SPARSE_INDEX_IGNORE_CONFIG flag [v3,8/8] sparse-checkout: clear tracked sparse dirs

Derrick Stolee Aug. 17, 2021, 1:23 p.m. UTC

From: Derrick Stolee <dstolee@microsoft.com>

The iterated search in find_cache_entry() was recently modified to
include a loop that searches backwards for a sparse directory entry that
matches the given traverse_info and name_entry. However, the string
comparison failed to actually concatenate those two strings, so this
failed to find a sparse directory when it was not a top-level directory.

This caused some errors in rare cases where a 'git checkout' spanned a

Johannes Schindelin Aug. 19, 2021, 8:01 a.m. UTC | #1

Hi Stolee,

On Tue, 17 Aug 2021, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> The iterated search in find_cache_entry() was recently modified to
> include a loop that searches backwards for a sparse directory entry that
> matches the given traverse_info and name_entry. However, the string
> comparison failed to actually concatenate those two strings, so this
> failed to find a sparse directory when it was not a top-level directory.
>
> This caused some errors in rare cases where a 'git checkout' spanned a
> diff that modified files within the sparse directory entry, but we could
> not correctly find the entry.

Good explanation.

I wonder a bit about the performance impact. How "hot" is this function?
I.e. how often is it called, on average?

I ask because I see opportunities to optimize in both directions: it could
be written more concisely (if speed does not matter as much), and it could
be made faster (if speed matters a lot). See below for more.

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  unpack-trees.c | 18 +++++++++++++-----
>  1 file changed, 13 insertions(+), 5 deletions(-)
>
> diff --git a/unpack-trees.c b/unpack-trees.c
> index 5786645f315..df1f4437723 100644
> --- a/unpack-trees.c
> +++ b/unpack-trees.c
> @@ -1255,9 +1255,10 @@ static int sparse_dir_matches_path(const struct cache_entry *ce,
>  static struct cache_entry *find_cache_entry(struct traverse_info *info,
>  					    const struct name_entry *p)
>  {
> -	struct cache_entry *ce;
> +	struct cache_entry *ce = NULL;

Makes sense: since you need to release the allocated memory, you can no
longer `return NULL` early, but have to break out of the loop and return
`ce`.

>  	int pos = find_cache_pos(info, p->path, p->pathlen);
>  	struct unpack_trees_options *o = info->data;
> +	struct strbuf full_path = STRBUF_INIT;
>
>  	if (0 <= pos)
>  		return o->src_index->cache[pos];
> @@ -1273,6 +1274,10 @@ static struct cache_entry *find_cache_entry(struct traverse_info *info,
>  	if (pos < 0 || pos >= o->src_index->cache_nr)
>  		return NULL;
>
> +	strbuf_addstr(&full_path, info->traverse_path);
> +	strbuf_add(&full_path, p->path, p->pathlen);
> +	strbuf_addch(&full_path, '/');

This could be reduced to:

	strbuf_addf(&full_path, "%s%.*s/",
		    info->traverse_path, (int)p->pathlen, p->path);

But if speed matters, we probably need something more like this:

	size_t full_path_len;
	const char *full_path;
	char *full_path_1 = NULL;

	if (!*info->traverse_path) {
		full_path = p->path;
		full_path_len = p->pathlen;
	} else {
		size_t len = strlen(info->traverse_path);

		full_path_len = len + p->pathlen + 1;
		full_path = full_path_1 = xmalloc(full_path_len + 1);
		memcpy(full_path_1, info->traverse_path, len);
		memcpy(full_path_1 + len, p->path, p->pathlen);
		full_path_1[full_path_len - 1] = '/';
		full_path_1[full_path_len] = '\0';
	}

	[...]

	free(full_path_1);

It would obviously be much nicer if we did not have to go for that ugly
long version...

> +
>  	/*
>  	 * Due to lexicographic sorting and sparse directory
>  	 * entries ending with a trailing slash, our path as a
> @@ -1283,17 +1288,20 @@ static struct cache_entry *find_cache_entry(struct traverse_info *info,
>  	while (pos >= 0) {
>  		ce = o->src_index->cache[pos];
>
> -		if (strncmp(ce->name, p->path, p->pathlen))
> -			return NULL;
> +		if (strncmp(ce->name, full_path.buf, full_path.len)) {
> +			ce = NULL;
> +			break;
> +		}
>
>  		if (S_ISSPARSEDIR(ce->ce_mode) &&
>  		    sparse_dir_matches_path(ce, info, p))
> -			return ce;
> +			break;
>
>  		pos--;
>  	}
>
> -	return NULL;
> +	strbuf_release(&full_path);
> +	return ce;
>  }
>
>  static void debug_path(struct traverse_info *info)
> --
> gitgitgadget
>
>

Elijah Newren Aug. 19, 2021, 6:29 p.m. UTC | #2

On Tue, Aug 17, 2021 at 6:23 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The iterated search in find_cache_entry() was recently modified to
> include a loop that searches backwards for a sparse directory entry that
> matches the given traverse_info and name_entry. However, the string
> comparison failed to actually concatenate those two strings, so this
> failed to find a sparse directory when it was not a top-level directory.
>
> This caused some errors in rare cases where a 'git checkout' spanned a
> diff that modified files within the sparse directory entry, but we could
> not correctly find the entry.

Doh, sorry for missing that in my review of the earlier series.

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  unpack-trees.c | 18 +++++++++++++-----
>  1 file changed, 13 insertions(+), 5 deletions(-)
>
> diff --git a/unpack-trees.c b/unpack-trees.c
> index 5786645f315..df1f4437723 100644
> --- a/unpack-trees.c
> +++ b/unpack-trees.c
> @@ -1255,9 +1255,10 @@ static int sparse_dir_matches_path(const struct cache_entry *ce,
>  static struct cache_entry *find_cache_entry(struct traverse_info *info,
>                                             const struct name_entry *p)
>  {
> -       struct cache_entry *ce;
> +       struct cache_entry *ce = NULL;
>         int pos = find_cache_pos(info, p->path, p->pathlen);
>         struct unpack_trees_options *o = info->data;
> +       struct strbuf full_path = STRBUF_INIT;
>
>         if (0 <= pos)
>                 return o->src_index->cache[pos];
> @@ -1273,6 +1274,10 @@ static struct cache_entry *find_cache_entry(struct traverse_info *info,
>         if (pos < 0 || pos >= o->src_index->cache_nr)
>                 return NULL;
>
> +       strbuf_addstr(&full_path, info->traverse_path);
> +       strbuf_add(&full_path, p->path, p->pathlen);
> +       strbuf_addch(&full_path, '/');
> +
>         /*
>          * Due to lexicographic sorting and sparse directory
>          * entries ending with a trailing slash, our path as a
> @@ -1283,17 +1288,20 @@ static struct cache_entry *find_cache_entry(struct traverse_info *info,
>         while (pos >= 0) {
>                 ce = o->src_index->cache[pos];
>
> -               if (strncmp(ce->name, p->path, p->pathlen))
> -                       return NULL;
> +               if (strncmp(ce->name, full_path.buf, full_path.len)) {
> +                       ce = NULL;
> +                       break;
> +               }
>
>                 if (S_ISSPARSEDIR(ce->ce_mode) &&
>                     sparse_dir_matches_path(ce, info, p))
> -                       return ce;
> +                       break;
>
>                 pos--;
>         }
>
> -       return NULL;
> +       strbuf_release(&full_path);
> +       return ce;
>  }
>
>  static void debug_path(struct traverse_info *info)
> --
> gitgitgadget

Makes sense.

Derrick Stolee Aug. 20, 2021, 3:18 p.m. UTC | #3

On 8/19/2021 4:01 AM, Johannes Schindelin wrote:
> Hi Stolee,
> 
> On Tue, 17 Aug 2021, Derrick Stolee via GitGitGadget wrote:
> 
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> The iterated search in find_cache_entry() was recently modified to
>> include a loop that searches backwards for a sparse directory entry that
>> matches the given traverse_info and name_entry. However, the string
>> comparison failed to actually concatenate those two strings, so this
>> failed to find a sparse directory when it was not a top-level directory.
>>
>> This caused some errors in rare cases where a 'git checkout' spanned a
>> diff that modified files within the sparse directory entry, but we could
>> not correctly find the entry.
> 
> Good explanation.
> 
> I wonder a bit about the performance impact. How "hot" is this function?
> I.e. how often is it called, on average?
> 
> I ask because I see opportunities to optimize in both directions: it could
> be written more concisely (if speed does not matter as much), and it could
> be made faster (if speed matters a lot). See below for more.

I would definitely optimize for speed here. This can be a very hot path,
I believe.

>> +	strbuf_addstr(&full_path, info->traverse_path);
>> +	strbuf_add(&full_path, p->path, p->pathlen);
>> +	strbuf_addch(&full_path, '/');
> 
> This could be reduced to:
> 
> 	strbuf_addf(&full_path, "%s%.*s/",
> 		    info->traverse_path, (int)p->pathlen, p->path);

We should definitely avoid formatted strings here, if possible.

> But if speed matters, we probably need something more like this:
> 
> 	size_t full_path_len;
> 	const char *full_path;
> 	char *full_path_1 = NULL;
> 
> 	if (!*info->traverse_path) {
> 		full_path = p->path;
> 		full_path_len = p->pathlen;
> 	} else {
> 		size_t len = strlen(info->traverse_path);
> 
> 		full_path_len = len + p->pathlen + 1;
> 		full_path = full_path_1 = xmalloc(full_path_len + 1);
> 		memcpy(full_path_1, info->traverse_path, len);
> 		memcpy(full_path_1 + len, p->path, p->pathlen);
> 		full_path_1[full_path_len - 1] = '/';
> 		full_path_1[full_path_len] = '\0';
> 	}

The critical benefit here is that we do not need to allocate a
buffer if the traverse_path does not exist. That might be a
worthwhile investment. That leads to justifying the use of
bare 'char *'s instead of 'struct strbuf'.

If the traverse_path is usually non-null, then we could continue using
strbufs as a helper and get the planned performance gains by using
strbuf_grow(&full_path, full_path_len + 1) followed by strbuf_add()
(instead of strbuf_addstr()). That would make this code a bit less
ugly with the only real overhead being the extra insertions of '\0'
characters as we add the strings to the strbuf().

I will need to investigate so see which one is the best.

Thanks,
-Stolee

René Scharfe Aug. 20, 2021, 7:35 p.m. UTC | #4

Am 20.08.21 um 17:18 schrieb Derrick Stolee:
> On 8/19/2021 4:01 AM, Johannes Schindelin wrote:
>> Hi Stolee,
>>
>> On Tue, 17 Aug 2021, Derrick Stolee via GitGitGadget wrote:
>>
>>> From: Derrick Stolee <dstolee@microsoft.com>
>>>
>>> The iterated search in find_cache_entry() was recently modified to
>>> include a loop that searches backwards for a sparse directory entry that
>>> matches the given traverse_info and name_entry. However, the string
>>> comparison failed to actually concatenate those two strings, so this
>>> failed to find a sparse directory when it was not a top-level directory.
>>>
>>> This caused some errors in rare cases where a 'git checkout' spanned a
>>> diff that modified files within the sparse directory entry, but we could
>>> not correctly find the entry.
>>
>> Good explanation.
>>
>> I wonder a bit about the performance impact. How "hot" is this function?
>> I.e. how often is it called, on average?
>>
>> I ask because I see opportunities to optimize in both directions: it could
>> be written more concisely (if speed does not matter as much), and it could
>> be made faster (if speed matters a lot). See below for more.
>
> I would definitely optimize for speed here. This can be a very hot path,
> I believe.
>
>>> +	strbuf_addstr(&full_path, info->traverse_path);
>>> +	strbuf_add(&full_path, p->path, p->pathlen);
>>> +	strbuf_addch(&full_path, '/');
>>
>> This could be reduced to:
>>
>> 	strbuf_addf(&full_path, "%s%.*s/",
>> 		    info->traverse_path, (int)p->pathlen, p->path);
>
> We should definitely avoid formatted strings here, if possible.
>
>> But if speed matters, we probably need something more like this:
>>
>> 	size_t full_path_len;
>> 	const char *full_path;
>> 	char *full_path_1 = NULL;
>>
>> 	if (!*info->traverse_path) {
>> 		full_path = p->path;
>> 		full_path_len = p->pathlen;
>> 	} else {
>> 		size_t len = strlen(info->traverse_path);
>>
>> 		full_path_len = len + p->pathlen + 1;
>> 		full_path = full_path_1 = xmalloc(full_path_len + 1);
>> 		memcpy(full_path_1, info->traverse_path, len);
>> 		memcpy(full_path_1 + len, p->path, p->pathlen);
>> 		full_path_1[full_path_len - 1] = '/';
>> 		full_path_1[full_path_len] = '\0';
>> 	}
>
> The critical benefit here is that we do not need to allocate a
> buffer if the traverse_path does not exist. That might be a
> worthwhile investment. That leads to justifying the use of
> bare 'char *'s instead of 'struct strbuf'.
>
> If the traverse_path is usually non-null, then we could continue using
> strbufs as a helper and get the planned performance gains by using
> strbuf_grow(&full_path, full_path_len + 1) followed by strbuf_add()
> (instead of strbuf_addstr()). That would make this code a bit less
> ugly with the only real overhead being the extra insertions of '\0'
> characters as we add the strings to the strbuf().

You create full_path only to compare it to another string.  You can
compare the pieces directly, without allocating and copying:

	const char *path;

	if (!skip_prefix(ce->name, info->traverse_path, &path) ||
	    strncmp(path, p->path, p->pathlen) ||
	    strcmp(path + p->pathlen, "/"))
		return NULL;

A test would be nice to demonstrate the fixed issue.

René

René Scharfe Aug. 20, 2021, 8:22 p.m. UTC | #5

Am 20.08.21 um 21:35 schrieb René Scharfe:
> Am 20.08.21 um 17:18 schrieb Derrick Stolee:
>> On 8/19/2021 4:01 AM, Johannes Schindelin wrote:
>>> Hi Stolee,
>>>
>>> On Tue, 17 Aug 2021, Derrick Stolee via GitGitGadget wrote:
>>>
>>>> From: Derrick Stolee <dstolee@microsoft.com>
>>>>
>>>> The iterated search in find_cache_entry() was recently modified to
>>>> include a loop that searches backwards for a sparse directory entry that
>>>> matches the given traverse_info and name_entry. However, the string
>>>> comparison failed to actually concatenate those two strings, so this
>>>> failed to find a sparse directory when it was not a top-level directory.
>>>>
>>>> This caused some errors in rare cases where a 'git checkout' spanned a
>>>> diff that modified files within the sparse directory entry, but we could
>>>> not correctly find the entry.
>>>
>>> Good explanation.
>>>
>>> I wonder a bit about the performance impact. How "hot" is this function?
>>> I.e. how often is it called, on average?
>>>
>>> I ask because I see opportunities to optimize in both directions: it could
>>> be written more concisely (if speed does not matter as much), and it could
>>> be made faster (if speed matters a lot). See below for more.
>>
>> I would definitely optimize for speed here. This can be a very hot path,
>> I believe.
>>
>>>> +	strbuf_addstr(&full_path, info->traverse_path);
>>>> +	strbuf_add(&full_path, p->path, p->pathlen);
>>>> +	strbuf_addch(&full_path, '/');
>>>
>>> This could be reduced to:
>>>
>>> 	strbuf_addf(&full_path, "%s%.*s/",
>>> 		    info->traverse_path, (int)p->pathlen, p->path);
>>
>> We should definitely avoid formatted strings here, if possible.
>>
>>> But if speed matters, we probably need something more like this:
>>>
>>> 	size_t full_path_len;
>>> 	const char *full_path;
>>> 	char *full_path_1 = NULL;
>>>
>>> 	if (!*info->traverse_path) {
>>> 		full_path = p->path;
>>> 		full_path_len = p->pathlen;
>>> 	} else {
>>> 		size_t len = strlen(info->traverse_path);
>>>
>>> 		full_path_len = len + p->pathlen + 1;
>>> 		full_path = full_path_1 = xmalloc(full_path_len + 1);
>>> 		memcpy(full_path_1, info->traverse_path, len);
>>> 		memcpy(full_path_1 + len, p->path, p->pathlen);
>>> 		full_path_1[full_path_len - 1] = '/';
>>> 		full_path_1[full_path_len] = '\0';
>>> 	}
>>
>> The critical benefit here is that we do not need to allocate a
>> buffer if the traverse_path does not exist. That might be a
>> worthwhile investment. That leads to justifying the use of
>> bare 'char *'s instead of 'struct strbuf'.
>>
>> If the traverse_path is usually non-null, then we could continue using
>> strbufs as a helper and get the planned performance gains by using
>> strbuf_grow(&full_path, full_path_len + 1) followed by strbuf_add()
>> (instead of strbuf_addstr()). That would make this code a bit less
>> ugly with the only real overhead being the extra insertions of '\0'
>> characters as we add the strings to the strbuf().
>
> You create full_path only to compare it to another string.  You can
> compare the pieces directly, without allocating and copying:
>
> 	const char *path;
>
> 	if (!skip_prefix(ce->name, info->traverse_path, &path) ||
> 	    strncmp(path, p->path, p->pathlen) ||
> 	    strcmp(path + p->pathlen, "/"))

The strcmp line is wrong (should be path[p->pathlen] != '/'), but
you get the idea..

> 		return NULL;
>
> A test would be nice to demonstrate the fixed issue.
>
> René
>

[v3,4/8] unpack-trees: fix nested sparse-dir search

Commit Message

Comments

Patch