diff mbox series

[v3,02/20] t/perf: add performance test for sparse operations

Message ID d2197e895e4d4160fa369e2ba7d82e2e5a7fbc01.1615912983.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series Sparse Index: Design, Format, Tests | expand

Commit Message

Derrick Stolee March 16, 2021, 4:42 p.m. UTC
From: Derrick Stolee <dstolee@microsoft.com>

Create a test script that takes the default performance test (the Git
codebase) and multiplies it by 256 using four layers of duplicated
trees of width four. This results in nearly one million blob entries in
the index. Then, we can clone this repository with sparse-checkout
patterns that demonstrate four copies of the initial repository. Each
clone will use a different index format or mode so peformance can be
tested across the different options.

Note that the initial repo is stripped of submodules before doing the
copies. This preserves the expected data shape of the sparse index,
because directories containing submodules are not collapsed to a sparse
directory entry.

Run a few Git commands on these clones, especially those that use the
index (status, add, commit).

Here are the results on my Linux machine:

Test
--------------------------------------------------------------
2000.2: git status (full-index-v3)             0.37(0.30+0.09)
2000.3: git status (full-index-v4)             0.39(0.32+0.10)
2000.4: git add -A (full-index-v3)             1.42(1.06+0.20)
2000.5: git add -A (full-index-v4)             1.26(0.98+0.16)
2000.6: git add . (full-index-v3)              1.40(1.04+0.18)
2000.7: git add . (full-index-v4)              1.26(0.98+0.17)
2000.8: git commit -a -m A (full-index-v3)     1.42(1.11+0.16)
2000.9: git commit -a -m A (full-index-v4)     1.33(1.08+0.16)

It is perhaps noteworthy that there is an improvement when using index
version 4. This is because the v3 index uses 108 MiB while the v4
index uses 80 MiB. Since the repeated portions of the directories are
very short (f3/f1/f2, for example) this ratio is less pronounced than in
similarly-sized real repositories.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/perf/p2000-sparse-operations.sh | 85 +++++++++++++++++++++++++++++++
 1 file changed, 85 insertions(+)
 create mode 100755 t/perf/p2000-sparse-operations.sh

Comments

Ævar Arnfjörð Bjarmason March 17, 2021, 8:41 a.m. UTC | #1
On Tue, Mar 16 2021, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> Create a test script that takes the default performance test (the Git
> codebase) and multiplies it by 256 using four layers of duplicated
> trees of width four. This results in nearly one million blob entries in
> the index. Then, we can clone this repository with sparse-checkout
> patterns that demonstrate four copies of the initial repository. Each
> clone will use a different index format or mode so peformance can be
> tested across the different options.
>
> Note that the initial repo is stripped of submodules before doing the
> copies. This preserves the expected data shape of the sparse index,
> because directories containing submodules are not collapsed to a sparse
> directory entry.
>
> Run a few Git commands on these clones, especially those that use the
> index (status, add, commit).
>
> Here are the results on my Linux machine:
>
> Test
> --------------------------------------------------------------
> 2000.2: git status (full-index-v3)             0.37(0.30+0.09)
> 2000.3: git status (full-index-v4)             0.39(0.32+0.10)
> 2000.4: git add -A (full-index-v3)             1.42(1.06+0.20)
> 2000.5: git add -A (full-index-v4)             1.26(0.98+0.16)
> 2000.6: git add . (full-index-v3)              1.40(1.04+0.18)
> 2000.7: git add . (full-index-v4)              1.26(0.98+0.17)
> 2000.8: git commit -a -m A (full-index-v3)     1.42(1.11+0.16)
> 2000.9: git commit -a -m A (full-index-v4)     1.33(1.08+0.16)
>
> It is perhaps noteworthy that there is an improvement when using index
> version 4. This is because the v3 index uses 108 MiB while the v4
> index uses 80 MiB. Since the repeated portions of the directories are
> very short (f3/f1/f2, for example) this ratio is less pronounced than in
> similarly-sized real repositories.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  t/perf/p2000-sparse-operations.sh | 85 +++++++++++++++++++++++++++++++
>  1 file changed, 85 insertions(+)
>  create mode 100755 t/perf/p2000-sparse-operations.sh
>
> diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
> new file mode 100755
> index 000000000000..2fbc81b22119
> --- /dev/null
> +++ b/t/perf/p2000-sparse-operations.sh
> @@ -0,0 +1,85 @@
> +#!/bin/sh
> +
> +test_description="test performance of Git operations using the index"
> +
> +. ./perf-lib.sh
> +
> +test_perf_default_repo
> +
> +SPARSE_CONE=f2/f4/f1
> +
> +test_expect_success 'setup repo and indexes' '
> +	git reset --hard HEAD &&
> +	# Remove submodules from the example repo, because our
> +	# duplication of the entire repo creates an unlikly data shape.
> +	git config --file .gitmodules --get-regexp "submodule.*.path" >modules &&
> +	git rm -f .gitmodules &&
> +	for module in $(awk "{print \$2}" modules)
> +	do
> +		git rm $module || return 1
> +	done &&
> +	git commit -m "remove submodules" &&

Paradoxically with this you can no longer use a repo that's not git.git
or another repo that has submodules, since we'll die in trying to remove
them.

Also you don't have to "git rm .gitmodules", the "git rm" command
removes submodule entries.

Perhaps just:

    for module in $(git ls-files --stage | grep ^160000 | awk -F '\t' '{ print $2 }')
    do
        git rm "$module"
    done

Or another way of guarding against rm getting the empty list && commit?

But it seems odd to be doing this at all, the point of the perf
framework is that you can point it at any repo, and some repos you want
to test will have submodules.

Seems like something like the WIP patch at the end on top would be
better.

> +	echo bogus >a &&
> +	cp a b &&
> +	git add a b &&
> +	git commit -m "level 0" &&
> +	BLOB=$(git rev-parse HEAD:a) &&

Isn't the way we're getting this $BLOB equivalent to just 'echo bogus |
git hash-object --stdin -w' why commit it?

> +	OLD_COMMIT=$(git rev-parse HEAD) &&
> +	OLD_TREE=$(git rev-parse HEAD^{tree}) &&
> +
> +	for i in $(test_seq 1 4)
> +	do
> +		cat >in <<-EOF &&
> +			100755 blob $BLOB	a
> +			040000 tree $OLD_TREE	f1
> +			040000 tree $OLD_TREE	f2
> +			040000 tree $OLD_TREE	f3
> +			040000 tree $OLD_TREE	f4
> +		EOF
> +		NEW_TREE=$(git mktree <in) &&
> +		NEW_COMMIT=$(git commit-tree $NEW_TREE -p $OLD_COMMIT -m "level $i") &&
> +		OLD_TREE=$NEW_TREE &&
> +		OLD_COMMIT=$NEW_COMMIT || return 1
> +	done &&
> +
> +	git sparse-checkout init --cone &&
> +	git branch -f wide $OLD_COMMIT &&
> +	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . full-index-v3 &&
> +	(
> +		cd full-index-v3 &&
> +		git sparse-checkout init --cone &&
> +		git sparse-checkout set $SPARSE_CONE &&
> +		git config index.version 3 &&
> +		git update-index --index-version=3
> +	) &&
> +	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . full-index-v4 &&
> +	(
> +		cd full-index-v4 &&
> +		git sparse-checkout init --cone &&
> +		git sparse-checkout set $SPARSE_CONE &&
> +		git config index.version 4 &&
> +		git update-index --index-version=4
> +	)
> +'

This whole thing makes me think you just wanted a test_perf_fresh_repo
all along, but I think this would be much more useful if you took the
default repo and multiplied the size in its tree by some multiple.

E.g. take the files we have in git.git, write a copy at prefix-1/,
prefix-2/ etc.

The whole point of test_perf_{default,large}_repo is being able to point
them at a local repo you're testing for performance and get numbers
representative of that repo.

So maybe that's not what's wanted here at all, but that brings us back
to test_perf_fresh_repo...

> +test_perf_on_all () {
> +	command="$@"
> +	for repo in full-index-v3 full-index-v4
> +	do
> +		test_perf "$command ($repo)" "
> +			(
> +				cd $repo &&
> +				echo >>$SPARSE_CONE/a &&
> +				$command
> +			)
> +		"
> +	done
> +}
> +
> +test_perf_on_all git status
> +test_perf_on_all git add -A
> +test_perf_on_all git add .
> +test_perf_on_all git commit -a -m A
> +
> +test_done

diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
index e527316e66..2c07b04159 100755
--- a/t/perf/p2000-sparse-operations.sh
+++ b/t/perf/p2000-sparse-operations.sh
@@ -4,22 +4,11 @@ test_description="test performance of Git operations using the index"
 
 . ./perf-lib.sh
 
-test_perf_default_repo
+test_perf_nosubodules_repo
 
 SPARSE_CONE=f2/f4/f1
 
 test_expect_success 'setup repo and indexes' '
-	git reset --hard HEAD &&
-	# Remove submodules from the example repo, because our
-	# duplication of the entire repo creates an unlikly data shape.
-	git config --file .gitmodules --get-regexp "submodule.*.path" >modules &&
-	git rm -f .gitmodules &&
-	for module in $(awk "{print \$2}" modules)
-	do
-		git rm $module || return 1
-	done &&
-	git commit -m "remove submodules" &&
-
 	echo bogus >a &&
 	cp a b &&
 	git add a b &&
diff --git a/t/perf/perf-lib.sh b/t/perf/perf-lib.sh
index e385c6896f..86b716ce8f 100644
--- a/t/perf/perf-lib.sh
+++ b/t/perf/perf-lib.sh
@@ -128,6 +128,15 @@ test_perf_large_repo () {
 	fi
 	test_perf_create_repo_from "${1:-$TRASH_DIRECTORY}" "$GIT_PERF_LARGE_REPO"
 }
+test_perf_nosubodules_repo () {
+	if test "$GIT_PERF_NOSUBMODULES_REPO" = "$GIT_BUILD_DIR"; then
+		echo "warning: \$GIT_PERF_NOSUBMODULES_REPO is \$GIT_BUILD_DIR." >&2
+		echo "warning: This will probably work, but it has a submodule!" >&2
+		echo "warning: point to another repo for representative measurements." >&2
+		# git rm dance here? optionally?
+	fi
+	test_perf_create_repo_from "${1:-$TRASH_DIRECTORY}" "$GIT_PERF_NOSUBMODULES_REPO"
+}
 test_checkout_worktree () {
 	git checkout-index -u -a ||
 	error "git checkout-index failed"
@@ -196,7 +205,7 @@ test_perf_ () {
 	else
 		echo "perf $test_count - $1:"
 	fi
-	for i in $(test_seq 1 $GIT_PERF_REPEAT_COUNT); do
+	for i in $(test_seq 1 $GIT_PERF_REP
 		say >&3 "running: $2"
 		if test_run_perf_ "$2"
 		then
Derrick Stolee March 17, 2021, 1:05 p.m. UTC | #2
On 3/17/2021 4:41 AM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Tue, Mar 16 2021, Derrick Stolee via GitGitGadget wrote:
>> +test_expect_success 'setup repo and indexes' '
>> +	git reset --hard HEAD &&
>> +	# Remove submodules from the example repo, because our
>> +	# duplication of the entire repo creates an unlikly data shape.
>> +	git config --file .gitmodules --get-regexp "submodule.*.path" >modules &&
>> +	git rm -f .gitmodules &&
>> +	for module in $(awk "{print \$2}" modules)
>> +	do
>> +		git rm $module || return 1
>> +	done &&
>> +	git commit -m "remove submodules" &&
> 
> Paradoxically with this you can no longer use a repo that's not git.git
> or another repo that has submodules, since we'll die in trying to remove
> them.

Good point.

> Also you don't have to "git rm .gitmodules", the "git rm" command
> removes submodule entries.

Sure.

> Perhaps just:
> 
>     for module in $(git ls-files --stage | grep ^160000 | awk -F '\t' '{ print $2 }')
>     do
>         git rm "$module"
>     done
> 
> Or another way of guarding against rm getting the empty list && commit?
> 
> But it seems odd to be doing this at all, the point of the perf
> framework is that you can point it at any repo, and some repos you want
> to test will have submodules.

You're right that it should handle all repos. However, the point of
the test is to have many copies of the repo, but most of them are
excluded by sparse-directory entries. We don't collapse sparse-directory
entries if there is a submodule inside, so the data shape is wrong after
making all the copies.

So, I disagree with your approach in your suggested diff, and instead
offer this one. I've tested this with git.git and another local repo
without submodules and checked that everything works as expected.

diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
index e527316e66d..5c0d78eeeea 100755
--- a/t/perf/p2000-sparse-operations.sh
+++ b/t/perf/p2000-sparse-operations.sh
@@ -10,15 +10,17 @@ SPARSE_CONE=f2/f4/f1
 
 test_expect_success 'setup repo and indexes' '
 	git reset --hard HEAD &&
+
 	# Remove submodules from the example repo, because our
-	# duplication of the entire repo creates an unlikly data shape.
-	git config --file .gitmodules --get-regexp "submodule.*.path" >modules &&
-	git rm -f .gitmodules &&
-	for module in $(awk "{print \$2}" modules)
-	do
-		git rm $module || return 1
-	done &&
-	git commit -m "remove submodules" &&
+	# duplication of the entire repo creates an unlikely data shape.
+	if (git config --file .gitmodules --get-regexp "submodule.*.path" >modules)
+	then
+		for module in $(awk "{print \$2}" modules)
+		do
+			git rm $module || return 1
+		done &&
+		git commit -m "remove submodules" || return 1
+	fi &&
 
 	echo bogus >a &&
 	cp a b &&

> Seems like something like the WIP patch at the end on top would be
> better.
> 
>> +	echo bogus >a &&
>> +	cp a b &&
>> +	git add a b &&
>> +	git commit -m "level 0" &&
>> +	BLOB=$(git rev-parse HEAD:a) &&
> 
> Isn't the way we're getting this $BLOB equivalent to just 'echo bogus |
> git hash-object --stdin -w' why commit it?

We are committing it so we can add commits that deepen the copies,
but within those copies we have these known file paths.

> This whole thing makes me think you just wanted a test_perf_fresh_repo
> all along, but I think this would be much more useful if you took the
> default repo and multiplied the size in its tree by some multiple.
> 
> E.g. take the files we have in git.git, write a copy at prefix-1/,
> prefix-2/ etc.

That is essentially what is happening here, but using multiple levels
of directories. Using these multiple levels presents extra tree
lookups and parsing in the event of expanding a sparse index to a
full one.

Thanks,
-Stolee
Ævar Arnfjörð Bjarmason March 17, 2021, 1:21 p.m. UTC | #3
On Wed, Mar 17 2021, Derrick Stolee wrote:

> On 3/17/2021 4:41 AM, Ævar Arnfjörð Bjarmason wrote:
>> 
>> On Tue, Mar 16 2021, Derrick Stolee via GitGitGadget wrote:
>>> +test_expect_success 'setup repo and indexes' '
>>> +	git reset --hard HEAD &&
>>> +	# Remove submodules from the example repo, because our
>>> +	# duplication of the entire repo creates an unlikly data shape.
>>> +	git config --file .gitmodules --get-regexp "submodule.*.path" >modules &&
>>> +	git rm -f .gitmodules &&
>>> +	for module in $(awk "{print \$2}" modules)
>>> +	do
>>> +		git rm $module || return 1
>>> +	done &&
>>> +	git commit -m "remove submodules" &&
>> 
>> Paradoxically with this you can no longer use a repo that's not git.git
>> or another repo that has submodules, since we'll die in trying to remove
>> them.
>
> Good point.
>
>> Also you don't have to "git rm .gitmodules", the "git rm" command
>> removes submodule entries.
>
> Sure.
>
>> Perhaps just:
>> 
>>     for module in $(git ls-files --stage | grep ^160000 | awk -F '\t' '{ print $2 }')
>>     do
>>         git rm "$module"
>>     done
>> 
>> Or another way of guarding against rm getting the empty list && commit?
>> 
>> But it seems odd to be doing this at all, the point of the perf
>> framework is that you can point it at any repo, and some repos you want
>> to test will have submodules.
>
> You're right that it should handle all repos. However, the point of
> the test is to have many copies of the repo, but most of them are
> excluded by sparse-directory entries. We don't collapse sparse-directory
> entries if there is a submodule inside, so the data shape is wrong after
> making all the copies.
>
> So, I disagree with your approach in your suggested diff, and instead
> offer this one. I've tested this with git.git and another local repo
> without submodules and checked that everything works as expected.

What's got me confused here is that there's two uses for the perf
framework in this context.

It's to use an empty/git.git as a test repo to demonstrate something,
but then also that you can run it in your arbitrary repo, and e.g. see
how much a given feature might benefit you.

Hence suggesting that maybe test_perf_fresh_repois better here, because
by using test_perf_default_repo you're creating the expectation that you
can run the perf test, observe an %X difference, and that'll be
give-or-take what you'll get for that use case if you enable the feature.

Except it won't because the repo has submodules, which we deleted for
the perf test...

> diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
> index e527316e66d..5c0d78eeeea 100755
> --- a/t/perf/p2000-sparse-operations.sh
> +++ b/t/perf/p2000-sparse-operations.sh
> @@ -10,15 +10,17 @@ SPARSE_CONE=f2/f4/f1
>  
>  test_expect_success 'setup repo and indexes' '
>  	git reset --hard HEAD &&
> +
>  	# Remove submodules from the example repo, because our
> -	# duplication of the entire repo creates an unlikly data shape.
> -	git config --file .gitmodules --get-regexp "submodule.*.path" >modules &&
> -	git rm -f .gitmodules &&
> -	for module in $(awk "{print \$2}" modules)
> -	do
> -		git rm $module || return 1
> -	done &&
> -	git commit -m "remove submodules" &&
> +	# duplication of the entire repo creates an unlikely data shape.
> +	if (git config --file .gitmodules --get-regexp "submodule.*.path" >modules)

A subshell isn't needed here.

FWIW the reason I got this out of ls-files is because you can have
submodules without .gitmodules entries, rare and broken, but seemed more
direct to grep the mode bits.

> +	then
> +		for module in $(awk "{print \$2}" modules)
> +		do
> +			git rm $module || return 1
> +		done &&

Once we know we have submodules we can just do this without the loop.

    git rm $(awk "{print \$2}" modules)



> +		git commit -m "remove submodules" || return 1
> +	fi &&
>  
>  	echo bogus >a &&
>  	cp a b &&
>
>> Seems like something like the WIP patch at the end on top would be
>> better.
>> 
>>> +	echo bogus >a &&
>>> +	cp a b &&
>>> +	git add a b &&
>>> +	git commit -m "level 0" &&
>>> +	BLOB=$(git rev-parse HEAD:a) &&
>> 
>> Isn't the way we're getting this $BLOB equivalent to just 'echo bogus |
>> git hash-object --stdin -w' why commit it?
>
> We are committing it so we can add commits that deepen the copies,
> but within those copies we have these known file paths.
>
>> This whole thing makes me think you just wanted a test_perf_fresh_repo
>> all along, but I think this would be much more useful if you took the
>> default repo and multiplied the size in its tree by some multiple.
>> 
>> E.g. take the files we have in git.git, write a copy at prefix-1/,
>> prefix-2/ etc.
>
> That is essentially what is happening here, but using multiple levels
> of directories. Using these multiple levels presents extra tree
> lookups and parsing in the event of expanding a sparse index to a
> full one.

*nod*

Anyway, this thread's a bit of a bikeshed on my part, I was just
wondering if & what part of the test relied on the existing repo if it
was mostly setting up its own test data.
Derrick Stolee March 17, 2021, 6:02 p.m. UTC | #4
On 3/17/2021 9:21 AM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Wed, Mar 17 2021, Derrick Stolee wrote:
> 
>> On 3/17/2021 4:41 AM, Ævar Arnfjörð Bjarmason wrote:
>>> But it seems odd to be doing this at all, the point of the perf
>>> framework is that you can point it at any repo, and some repos you want
>>> to test will have submodules.
>>
>> You're right that it should handle all repos. However, the point of
>> the test is to have many copies of the repo, but most of them are
>> excluded by sparse-directory entries. We don't collapse sparse-directory
>> entries if there is a submodule inside, so the data shape is wrong after
>> making all the copies.
>>
>> So, I disagree with your approach in your suggested diff, and instead
>> offer this one. I've tested this with git.git and another local repo
>> without submodules and checked that everything works as expected.
> 
> What's got me confused here is that there's two uses for the perf
> framework in this context.
> 
> It's to use an empty/git.git as a test repo to demonstrate something,
> but then also that you can run it in your arbitrary repo, and e.g. see
> how much a given feature might benefit you.
> 
> Hence suggesting that maybe test_perf_fresh_repois better here, because
> by using test_perf_default_repo you're creating the expectation that you
> can run the perf test, observe an %X difference, and that'll be
> give-or-take what you'll get for that use case if you enable the feature.
> 
> Except it won't because the repo has submodules, which we deleted for
> the perf test...

I'm also dramatically changing the repository shape to expose index
reads and writes as a bottleneck. The benefit of using other repos
(like git.git or optionally choosing the Linux kernel repo) is to
change how much of the time is spent crawling the populated set.

>> diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
>> index e527316e66d..5c0d78eeeea 100755
>> --- a/t/perf/p2000-sparse-operations.sh
>> +++ b/t/perf/p2000-sparse-operations.sh
>> @@ -10,15 +10,17 @@ SPARSE_CONE=f2/f4/f1
>>  
>>  test_expect_success 'setup repo and indexes' '
>>  	git reset --hard HEAD &&
>> +
>>  	# Remove submodules from the example repo, because our
>> -	# duplication of the entire repo creates an unlikly data shape.
>> -	git config --file .gitmodules --get-regexp "submodule.*.path" >modules &&
>> -	git rm -f .gitmodules &&
>> -	for module in $(awk "{print \$2}" modules)
>> -	do
>> -		git rm $module || return 1
>> -	done &&
>> -	git commit -m "remove submodules" &&
>> +	# duplication of the entire repo creates an unlikely data shape.
>> +	if (git config --file .gitmodules --get-regexp "submodule.*.path" >modules)
> 
> A subshell isn't needed here.
> 
> FWIW the reason I got this out of ls-files is because you can have
> submodules without .gitmodules entries, rare and broken, but seemed more
> direct to grep the mode bits.

I'd prefer to do something (textually) simpler, expecting the input
repos to have correct data.

>> +	then
>> +		for module in $(awk "{print \$2}" modules)
>> +		do
>> +			git rm $module || return 1
>> +		done &&
> 
> Once we know we have submodules we can just do this without the loop.
> 
>     git rm $(awk "{print \$2}" modules)

Ok. That works for me.
>>> Seems like something like the WIP patch at the end on top would be
>>> better.
>>>
>>>> +	echo bogus >a &&
>>>> +	cp a b &&
>>>> +	git add a b &&
>>>> +	git commit -m "level 0" &&
>>>> +	BLOB=$(git rev-parse HEAD:a) &&
>>>
>>> Isn't the way we're getting this $BLOB equivalent to just 'echo bogus |
>>> git hash-object --stdin -w' why commit it?
>>
>> We are committing it so we can add commits that deepen the copies,
>> but within those copies we have these known file paths.
>>
>>> This whole thing makes me think you just wanted a test_perf_fresh_repo
>>> all along, but I think this would be much more useful if you took the
>>> default repo and multiplied the size in its tree by some multiple.
>>>
>>> E.g. take the files we have in git.git, write a copy at prefix-1/,
>>> prefix-2/ etc.
>>
>> That is essentially what is happening here, but using multiple levels
>> of directories. Using these multiple levels presents extra tree
>> lookups and parsing in the event of expanding a sparse index to a
>> full one.
> 
> *nod*
> 
> Anyway, this thread's a bit of a bikeshed on my part, I was just
> wondering if & what part of the test relied on the existing repo if it
> was mostly setting up its own test data.

Again, the benefit is to depend on the repo shape in some aspects,
while exaggerating the data shape to make the non-populated set
extremely large.

This presents different aspects that are worth examining, such as
git.git is much smaller than linux.git, and that is noticable with
these different performance numbers (taken at the end of this
series):

git.git
Test                                            this tree      
---------------------------------------------------------------
2000.2: git status (full-index-v3)              0.39(0.35+0.08)
2000.3: git status (full-index-v4)              0.39(0.34+0.09)
2000.4: git status (sparse-index-v3)            2.46(2.33+0.16)
2000.5: git status (sparse-index-v4)            2.42(2.31+0.15)
2000.6: git add -A (full-index-v3)              1.35(0.98+0.20)
2000.7: git add -A (full-index-v4)              1.25(0.96+0.18)
2000.8: git add -A (sparse-index-v3)            2.39(2.26+0.17)
2000.9: git add -A (sparse-index-v4)            2.35(2.29+0.11)
2000.10: git add . (full-index-v3)              1.39(1.01+0.19)
2000.11: git add . (full-index-v4)              1.31(1.00+0.19)
2000.12: git add . (sparse-index-v3)            2.41(2.28+0.16)
2000.13: git add . (sparse-index-v4)            2.45(2.32+0.16)
2000.14: git commit -a -m A (full-index-v3)     1.44(1.08+0.21)
2000.15: git commit -a -m A (full-index-v4)     1.31(1.04+0.19)
2000.16: git commit -a -m A (sparse-index-v3)   2.44(2.35+0.16)
2000.17: git commit -a -m A (sparse-index-v4)   2.44(2.36+0.16)

linux.git
Test                                            this tree        
-----------------------------------------------------------------
2000.2: git status (full-index-v3)              7.14(6.06+1.79)  
2000.3: git status (full-index-v4)              7.01(6.16+1.60)  
2000.4: git status (sparse-index-v3)            58.50(56.86+2.34)
2000.5: git status (sparse-index-v4)            57.52(55.80+2.45)
2000.6: git add -A (full-index-v3)              25.52(18.70+3.18)
2000.7: git add -A (full-index-v4)              22.26(17.52+2.72)
2000.8: git add -A (sparse-index-v3)            56.65(55.00+2.35)
2000.9: git add -A (sparse-index-v4)            56.56(54.98+2.29)
2000.10: git add . (full-index-v3)              25.87(19.12+3.15)
2000.11: git add . (full-index-v4)              22.56(17.85+2.71)
2000.12: git add . (sparse-index-v3)            57.01(55.28+2.42)
2000.13: git add . (sparse-index-v4)            56.84(55.38+2.19)
2000.14: git commit -a -m A (full-index-v3)     26.83(20.69+3.24)
2000.15: git commit -a -m A (full-index-v4)     24.04(19.86+2.65)
2000.16: git commit -a -m A (sparse-index-v3)   60.23(58.99+2.44)
2000.17: git commit -a -m A (sparse-index-v4)   60.52(59.09+2.74)

The intention is to make these numbers improve in the future
so that the sparse-index is a better approach.

Thanks,
-Stolee
diff mbox series

Patch

diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
new file mode 100755
index 000000000000..2fbc81b22119
--- /dev/null
+++ b/t/perf/p2000-sparse-operations.sh
@@ -0,0 +1,85 @@ 
+#!/bin/sh
+
+test_description="test performance of Git operations using the index"
+
+. ./perf-lib.sh
+
+test_perf_default_repo
+
+SPARSE_CONE=f2/f4/f1
+
+test_expect_success 'setup repo and indexes' '
+	git reset --hard HEAD &&
+	# Remove submodules from the example repo, because our
+	# duplication of the entire repo creates an unlikly data shape.
+	git config --file .gitmodules --get-regexp "submodule.*.path" >modules &&
+	git rm -f .gitmodules &&
+	for module in $(awk "{print \$2}" modules)
+	do
+		git rm $module || return 1
+	done &&
+	git commit -m "remove submodules" &&
+
+	echo bogus >a &&
+	cp a b &&
+	git add a b &&
+	git commit -m "level 0" &&
+	BLOB=$(git rev-parse HEAD:a) &&
+	OLD_COMMIT=$(git rev-parse HEAD) &&
+	OLD_TREE=$(git rev-parse HEAD^{tree}) &&
+
+	for i in $(test_seq 1 4)
+	do
+		cat >in <<-EOF &&
+			100755 blob $BLOB	a
+			040000 tree $OLD_TREE	f1
+			040000 tree $OLD_TREE	f2
+			040000 tree $OLD_TREE	f3
+			040000 tree $OLD_TREE	f4
+		EOF
+		NEW_TREE=$(git mktree <in) &&
+		NEW_COMMIT=$(git commit-tree $NEW_TREE -p $OLD_COMMIT -m "level $i") &&
+		OLD_TREE=$NEW_TREE &&
+		OLD_COMMIT=$NEW_COMMIT || return 1
+	done &&
+
+	git sparse-checkout init --cone &&
+	git branch -f wide $OLD_COMMIT &&
+	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . full-index-v3 &&
+	(
+		cd full-index-v3 &&
+		git sparse-checkout init --cone &&
+		git sparse-checkout set $SPARSE_CONE &&
+		git config index.version 3 &&
+		git update-index --index-version=3
+	) &&
+	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . full-index-v4 &&
+	(
+		cd full-index-v4 &&
+		git sparse-checkout init --cone &&
+		git sparse-checkout set $SPARSE_CONE &&
+		git config index.version 4 &&
+		git update-index --index-version=4
+	)
+'
+
+test_perf_on_all () {
+	command="$@"
+	for repo in full-index-v3 full-index-v4
+	do
+		test_perf "$command ($repo)" "
+			(
+				cd $repo &&
+				echo >>$SPARSE_CONE/a &&
+				$command
+			)
+		"
+	done
+}
+
+test_perf_on_all git status
+test_perf_on_all git add -A
+test_perf_on_all git add .
+test_perf_on_all git commit -a -m A
+
+test_done