diff mbox series

[v5,22/26] t7700: consolidate code into test_no_missing_in_packs()

Message ID a99a45cb6f2cc7328ef0e52fc2ea8fec537bfba9.1574884302.git.liu.denton@gmail.com (mailing list archive)
State New, archived
Headers show
Series t: test cleanup stemming from experimentally enabling pipefail | expand

Commit Message

Denton Liu Nov. 27, 2019, 7:53 p.m. UTC
The code to test that objects were not missing from the packfile was
duplicated many times. Extract the duplicated code into
test_no_missing_in_packs() and use that instead.

Refactor the resulting extraction so that if any git commands fail,
their return codes are not silently lost.

We were using sed to filter lines. Although not incorrect, this is
exactly what grep is built for. Replace this invocation of sed with grep
so that we use the correct tool for the job.

Instead of verifying each file of `alt_objects/pack/*.idx` individually
in a for-loop, batch them together into one verification step.

The original testing construct was O(n^2): it used a grep in a loop to
test whether any objects were missing in the packfile. Rewrite this to
sort the files then use `comm -23` so that finding missing lines from
the original file is done more efficiently.

While we're at it, add a space to `commit_and_pack ()` for style.

Signed-off-by: Denton Liu <liu.denton@gmail.com>
---
 t/t7700-repack.sh | 55 +++++++++++++----------------------------------
 1 file changed, 15 insertions(+), 40 deletions(-)

Comments

Junio C Hamano Nov. 29, 2019, 9:39 p.m. UTC | #1
Denton Liu <liu.denton@gmail.com> writes:

> The code to test that objects were not missing from the packfile was
> duplicated many times. Extract the duplicated code into
> test_no_missing_in_packs() and use that instead.
>
> Refactor the resulting extraction so that if any git commands fail,
> their return codes are not silently lost.
>
> We were using sed to filter lines. Although not incorrect, this is
> exactly what grep is built for. Replace this invocation of sed with grep
> so that we use the correct tool for the job.

Well, 

    $ sed -n -e 's/required match/desired part of the line/p'

is much much more approirate than 

    $ grep -e "requred match" |
      extract desired part of the line

"grep" is better only if the original were

    $ sed -n -e '/required match/p'

but everybody would write it with grep to begin with ;-)

So, I dunno about this part of the conversion.

> Instead of verifying each file of `alt_objects/pack/*.idx` individually
> in a for-loop, batch them together into one verification step.

Do you mean this one?

	git verify-pack -v alt_objects/pack/*.idx

where we may pass 1 or more .idx file to the command?  At first my
reading was interrupted by a "Huh?", but that does look good.  We'd
need to be a bit careful to make sure that we have at least 1 .idx
file, as the shell will happily feed a file whose name is "*.idx",
which verify-pack would be unhappy about.

> The original testing construct was O(n^2): it used a grep in a loop to
> test whether any objects were missing in the packfile. Rewrite this to
> sort the files then use `comm -23` so that finding missing lines from
> the original file is done more efficiently.

OK.  If we an show measurable speedups, it would be great, but the
loop structure does look O(n^2) and unnecessary costly.

> +test_no_missing_in_packs () {
> +	myidx=$(ls -1 .git/objects/pack/*.idx) &&
> +	test_path_is_file "$myidx" &&

If there are 2 or more .idx files, or if there is none, $myidx would
hopefully be a concatenation of these filenames or a string that
ends with asterisk-dot-idx and would fail path_is_file.  Sounds OK.

Ah, I do not have to review this part---these are repeated patterns
in the original.

> +	git verify-pack -v alt_objects/pack/*.idx >orig.raw &&
> +	grep "^[0-9a-f]\{40\}" orig.raw | cut -d" " -f1 | sort >orig &&

If output from 'grep' can be used as-is, it is worth doing, but if
you have to pipe it to cut, the original that used sed to filter and
edit the line would probably be a better way to write it.

> +	git verify-pack -v $myidx >dest.raw &&

This part does not quote $myidx" (inherited from the original); it
probably is OK, as any potentially problematic value in $myidx would
have been caught as an error much earlier in this test.
Denton Liu Dec. 2, 2019, 8:50 p.m. UTC | #2
Hi Junio,

On Fri, Nov 29, 2019 at 01:39:30PM -0800, Junio C Hamano wrote:
> Denton Liu <liu.denton@gmail.com> writes:
> 
> > The code to test that objects were not missing from the packfile was
> > duplicated many times. Extract the duplicated code into
> > test_no_missing_in_packs() and use that instead.
> >
> > Refactor the resulting extraction so that if any git commands fail,
> > their return codes are not silently lost.
> >
> > We were using sed to filter lines. Although not incorrect, this is
> > exactly what grep is built for. Replace this invocation of sed with grep
> > so that we use the correct tool for the job.
> 
> Well, 
> 
>     $ sed -n -e 's/required match/desired part of the line/p'
> 
> is much much more approirate than 
> 
>     $ grep -e "requred match" |
>       extract desired part of the line
> 
> "grep" is better only if the original were
> 
>     $ sed -n -e '/required match/p'
> 
> but everybody would write it with grep to begin with ;-)

This was what I was intending. It was originally written like the above
and it made sense to convert it to use grep. I guess "filter lines" in
my commit message is a little bit vague. Could we change this to "filter
matching lines" perhaps?

> 
> So, I dunno about this part of the conversion.
> 
> > Instead of verifying each file of `alt_objects/pack/*.idx` individually
> > in a for-loop, batch them together into one verification step.
> 
> Do you mean this one?
> 
> 	git verify-pack -v alt_objects/pack/*.idx
> 
> where we may pass 1 or more .idx file to the command?  At first my
> reading was interrupted by a "Huh?", but that does look good.  We'd
> need to be a bit careful to make sure that we have at least 1 .idx
> file, as the shell will happily feed a file whose name is "*.idx",
> which verify-pack would be unhappy about.
> 
> > The original testing construct was O(n^2): it used a grep in a loop to
> > test whether any objects were missing in the packfile. Rewrite this to
> > sort the files then use `comm -23` so that finding missing lines from
> > the original file is done more efficiently.
> 
> OK.  If we an show measurable speedups, it would be great, but the
> loop structure does look O(n^2) and unnecessary costly.
> 
> > +test_no_missing_in_packs () {
> > +	myidx=$(ls -1 .git/objects/pack/*.idx) &&
> > +	test_path_is_file "$myidx" &&
> 
> If there are 2 or more .idx files, or if there is none, $myidx would
> hopefully be a concatenation of these filenames or a string that
> ends with asterisk-dot-idx and would fail path_is_file.  Sounds OK.
> 
> Ah, I do not have to review this part---these are repeated patterns
> in the original.
> 
> > +	git verify-pack -v alt_objects/pack/*.idx >orig.raw &&
> > +	grep "^[0-9a-f]\{40\}" orig.raw | cut -d" " -f1 | sort >orig &&
> 
> If output from 'grep' can be used as-is, it is worth doing, but if
> you have to pipe it to cut, the original that used sed to filter and
> edit the line would probably be a better way to write it.

The original sed actually only filtered the line; no editing done. The
cut invocation was a consequence of using comm. Previously, in the while
loop, we would separate the line into `sha1` and `rest` components and
only match using the `sha1`. Since we use comm now, we have to use cut
to grab the sha1 and omit the rest of the line.

We could rewrite it with sed like this:

	sed -n -e "/^[0-9a-f]\{40\}/s/^\($[0-9a-f]\{40\}\).*/\1/" orig.raw

but I believe that breaking it into grep and cut makes the intent much
more clear. 

What do you think?

(By the way, if I were to reroll this series, should I keep sending out
the entire patchset? It feels very noisy to send out 20-something emails
every reroll when I'm just making a small one or two line change.)

Thanks,

Denton

> 
> > +	git verify-pack -v $myidx >dest.raw &&
> 
> This part does not quote $myidx" (inherited from the original); it
> probably is OK, as any potentially problematic value in $myidx would
> have been caught as an error much earlier in this test.
>
Junio C Hamano Dec. 2, 2019, 10:53 p.m. UTC | #3
Denton Liu <liu.denton@gmail.com> writes:

> Hi Junio,
> ...
>> "grep" is better only if the original were
>> 
>>     $ sed -n -e '/required match/p'
>> 
>> but everybody would write it with grep to begin with ;-)
>
> This was what I was intending. It was originally written like the above
> and it made sense to convert it to use grep. I guess "filter lines" in
> my commit message is a little bit vague. Could we change this to "filter
> matching lines" perhaps?

Ah, I see.  I somehow thought that some of the "sed" invocation in
the original version were doing "find lines and filter its
contents" (i.e. "-n -e 's/find/munge/p'"), but all three of them are
just "find lines" (i.e. "-n -e '/find/p'").  So I think the change
made by the patch is OK.

I think I was reacting to the output of "grep" being piped to "cut".
IOW, the original

	... | sed -n -e '/find/p' |
	while read sha1 garbage
	do
		... use sha1 ...

were rewritten to

	... >raw &&
	grep -e 'find' raw | cut -d" " -f1 >orig
	... use orig as a list of sha1s ...

But the "grep piped to cut" can be a single process

	... >raw &&
	sed -n -e 's/\(find\).*/\1/p' raw >orig
	... use orig as a list of sha1s ...

So in the tiniest picture, turning "sed -n -e /find/p" into "grep"
is not wrong per-se, but if you step back a bit and see a larger
picture, using "sed" a bit more effectively turns out to be still a
better rewrite.

... and I wrote the above before I read the remainder of your
response, where you considered which one is easier to read between
"grep piped to cut" and "sed" ;-)


> (By the way, if I were to reroll this series, should I keep sending out
> the entire patchset? It feels very noisy to send out 20-something emails
> every reroll when I'm just making a small one or two line change.)

Especially if it is near the end of the series, just a single step
is OK.  But is there anything that is glaringly wrong that needs a
reroll?  Or would it be "this is good enough, so let's have them
cook in 'next' and graduate to 'master'---further clean-up can be
done after all the dust settles"?  I have an impression that we
reached the latter by now.

Thanks.
Denton Liu Dec. 2, 2019, 11:28 p.m. UTC | #4
Hi Junio,

On Mon, Dec 02, 2019 at 02:53:09PM -0800, Junio C Hamano wrote:
> > (By the way, if I were to reroll this series, should I keep sending out
> > the entire patchset? It feels very noisy to send out 20-something emails
> > every reroll when I'm just making a small one or two line change.)
> 
> Especially if it is near the end of the series, just a single step
> is OK.  But is there anything that is glaringly wrong that needs a
> reroll?  Or would it be "this is good enough, so let's have them
> cook in 'next' and graduate to 'master'---further clean-up can be
> done after all the dust settles"?  I have an impression that we
> reached the latter by now.

Perhaps the log message could use some improvement to document the
discussion we had? I don't know if that's worth a reroll, though. Aside
from that, I agree that it's ready for 'next'.

> 
> Thanks.
Junio C Hamano Dec. 3, 2019, 3:41 p.m. UTC | #5
Denton Liu <liu.denton@gmail.com> writes:

>> Especially if it is near the end of the series, just a single step
>> is OK.  But is there anything that is glaringly wrong that needs a
>> reroll?  Or would it be "this is good enough, so let's have them
>> cook in 'next' and graduate to 'master'---further clean-up can be
>> done after all the dust settles"?  I have an impression that we
>> reached the latter by now.
>
> Perhaps the log message could use some improvement to document the
> discussion we had? I don't know if that's worth a reroll, though. Aside
> from that, I agree that it's ready for 'next'.

Sure, let's see what you have in mind.

Thanks for working on this.
Denton Liu Dec. 4, 2019, 7:24 a.m. UTC | #6
Hi Junio,

On Tue, Dec 03, 2019 at 07:41:19AM -0800, Junio C Hamano wrote:
> Denton Liu <liu.denton@gmail.com> writes:
> 
> >> Especially if it is near the end of the series, just a single step
> >> is OK.  But is there anything that is glaringly wrong that needs a
> >> reroll?  Or would it be "this is good enough, so let's have them
> >> cook in 'next' and graduate to 'master'---further clean-up can be
> >> done after all the dust settles"?  I have an impression that we
> >> reached the latter by now.
> >
> > Perhaps the log message could use some improvement to document the
> > discussion we had? I don't know if that's worth a reroll, though. Aside
> > from that, I agree that it's ready for 'next'.
> 
> Sure, let's see what you have in mind.

Here's a complete replacement for the commit message:

	t7700: consolidate code into test_no_missing_in_packs()

	The code to test that objects were not missing from the packfile was
	duplicated many times. Extract the duplicated code into
	test_no_missing_in_packs() and use that instead.

	Refactor the resulting extraction so that if any git commands fail,
	their return codes are not silently lost.

	We were using sed to filter lines. Although not incorrect, this is
	exactly what grep is built for. Replace this invocation of sed with grep
	so that we use the correct tool for the job.

	Instead of verifying each file of `alt_objects/pack/*.idx` individually
	in a for-loop, batch them together into one verification step.

	The original testing construct was O(n^2): it used a grep in a loop to
	test whether any objects were missing in the packfile. Rewrite this to
	sort the files then use `comm -23` so that finding missing lines from
	the original file is done more efficiently.

	The result of this is that we end up with a `grep | cut | sort`
	pipeline. Previously, we were extracting the `sha1` as part of the
	`while read sha1 rest` loop. Since we removed the while-loop, we need to
	use `cut` to extract the `sha1` field. Note that we could have chosen to
	combine the `grep | cut` into a single `sed` invocation but we
	consciously leave it separate as it makes the intent more clear.

	While we're at it, add a space to `commit_and_pack ()` for style.

	Signed-off-by: Denton Liu <liu.denton@gmail.com>

The only change between this and the old commit message is the addition
of the "The result of this..." paragraph.

Thanks,

Denton

> 
> Thanks for working on this.
Junio C Hamano Dec. 4, 2019, 6:13 p.m. UTC | #7
Denton Liu <liu.denton@gmail.com> writes:

>> Sure, let's see what you have in mind.
>
> Here's a complete replacement for the commit message:
>
> 	t7700: consolidate code into test_no_missing_in_packs()
>
> 	The code to test that objects were not missing from the packfile was
> 	duplicated many times. Extract the duplicated code into
> 	test_no_missing_in_packs() and use that instead.
>
> 	Refactor the resulting extraction so that if any git commands fail,
> 	their return codes are not silently lost.
>
> 	We were using sed to filter lines. Although not incorrect, this is
> 	exactly what grep is built for. Replace this invocation of sed with grep
> 	so that we use the correct tool for the job.
>
> 	Instead of verifying each file of `alt_objects/pack/*.idx` individually
> 	in a for-loop, batch them together into one verification step.
>
> 	The original testing construct was O(n^2): it used a grep in a loop to
> 	test whether any objects were missing in the packfile. Rewrite this to
> 	sort the files then use `comm -23` so that finding missing lines from
> 	the original file is done more efficiently.
>
> 	The result of this is that we end up with a `grep | cut | sort`
> 	pipeline. Previously, we were extracting the `sha1` as part of the
> 	`while read sha1 rest` loop. Since we removed the while-loop, we need to
> 	use `cut` to extract the `sha1` field. Note that we could have chosen to
> 	combine the `grep | cut` into a single `sed` invocation but we
> 	consciously leave it separate as it makes the intent more clear.
>
> 	While we're at it, add a space to `commit_and_pack ()` for style.
>
> 	Signed-off-by: Denton Liu <liu.denton@gmail.com>
>
> The only change between this and the old commit message is the addition
> of the "The result of this..." paragraph.

Ah, you were planning to only update the log message?  Then let's
not bother.  I do not think we would want to encourage "grep piped
to cut" as a good pattern for others to follow and a single sed that
finds the relevant lines and munges the content of these lines into
what is desired conveys the intent clearly and more concisely (I
was hoping that that was what you had in mind for a reroll).
diff mbox series

Patch

diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index 1d14ddcbdb..4bcd9fcc80 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -4,12 +4,23 @@  test_description='git repack works correctly'
 
 . ./test-lib.sh
 
-commit_and_pack() {
+commit_and_pack () {
 	test_commit "$@" 1>&2 &&
 	SHA1=$(git pack-objects --all --unpacked --incremental .git/objects/pack/pack </dev/null) &&
 	echo pack-${SHA1}.pack
 }
 
+test_no_missing_in_packs () {
+	myidx=$(ls -1 .git/objects/pack/*.idx) &&
+	test_path_is_file "$myidx" &&
+	git verify-pack -v alt_objects/pack/*.idx >orig.raw &&
+	grep "^[0-9a-f]\{40\}" orig.raw | cut -d" " -f1 | sort >orig &&
+	git verify-pack -v $myidx >dest.raw &&
+	cut -d" " -f1 dest.raw | sort >dest &&
+	comm -23 orig dest >missing &&
+	test_must_be_empty missing
+}
+
 test_expect_success 'objects in packs marked .keep are not repacked' '
 	echo content1 >file1 &&
 	echo content2 >file2 &&
@@ -105,19 +116,7 @@  test_expect_success 'packed obs in alt ODB are repacked even when local repo is
 	mkdir alt_objects/pack &&
 	mv .git/objects/pack/* alt_objects/pack &&
 	git repack -a &&
-	myidx=$(ls -1 .git/objects/pack/*.idx) &&
-	test_path_is_file "$myidx" &&
-	for p in alt_objects/pack/*.idx
-	do
-		git verify-pack -v $p | sed -n -e "/^[0-9a-f]\{40\}/p"
-	done | while read sha1 rest
-	do
-		if ! ( git verify-pack -v $myidx | grep "^$sha1" )
-		then
-			echo "Missing object in local pack: $sha1"
-			return 1
-		fi
-	done
+	test_no_missing_in_packs
 '
 
 test_expect_success 'packed obs in alt ODB are repacked when local repo has packs' '
@@ -128,19 +127,7 @@  test_expect_success 'packed obs in alt ODB are repacked when local repo has pack
 	git commit -m more_content &&
 	git repack &&
 	git repack -a -d &&
-	myidx=$(ls -1 .git/objects/pack/*.idx) &&
-	test_path_is_file "$myidx" &&
-	for p in alt_objects/pack/*.idx
-	do
-		git verify-pack -v $p | sed -n -e "/^[0-9a-f]\{40\}/p"
-	done | while read sha1 rest
-	do
-		if ! ( git verify-pack -v $myidx | grep "^$sha1" )
-		then
-			echo "Missing object in local pack: $sha1"
-			return 1
-		fi
-	done
+	test_no_missing_in_packs
 '
 
 test_expect_success 'packed obs in alternate ODB kept pack are repacked' '
@@ -156,19 +143,7 @@  test_expect_success 'packed obs in alternate ODB kept pack are repacked' '
 		fi
 	done &&
 	git repack -a -d &&
-	myidx=$(ls -1 .git/objects/pack/*.idx) &&
-	test_path_is_file "$myidx" &&
-	for p in alt_objects/pack/*.idx
-	do
-		git verify-pack -v $p | sed -n -e "/^[0-9a-f]\{40\}/p"
-	done | while read sha1 rest
-	do
-		if ! ( git verify-pack -v $myidx | grep "^$sha1" )
-		then
-			echo "Missing object in local pack: $sha1"
-			return 1
-		fi
-	done
+	test_no_missing_in_packs
 '
 
 test_expect_success 'packed unreachable obs in alternate ODB are not loosened' '