diff mbox series

[v3,4/8] t1051: introduce a smudge filter test for extremely large files

Message ID ce9dfaac9f8693890aa402161b38292b31d3690c.1635515959.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series Allow clean/smudge filters to handle huge files in the LLP64 data model | expand

Commit Message

Matt Cooper Oct. 29, 2021, 1:59 p.m. UTC
From: Matt Cooper <vtbassmatt@gmail.com>

The filter system allows for alterations to file contents when they're
added to the database or workdir. ("Smudge" when moving to the workdir;
"clean" when moving to the database.) This is used natively to handle CRLF
to LF conversions. It's also employed by Git-LFS to replace large files
from the workdir with small tracking files in the repo and vice versa.

Git pulls the entire smudged file into memory. While this is inefficient,
there's a more insidious problem on some platforms due to inconsistency
between using unsigned long and size_t for the same type of data (size of
a file in bytes). On most 64-bit platforms, unsigned long is 64 bits, and
size_t is typedef'd to unsigned long. On Windows, however, unsigned long is
only 32 bits (and therefore on 64-bit Windows, size_t is typedef'd to
unsigned long long in order to be 64 bits).

Practically speaking, this means 64-bit Windows users of Git-LFS can't
handle files larger than 2^32 bytes. Other 64-bit platforms don't suffer
this limitation.

This commit introduces a test exposing the issue; future commits make it
pass. The test simulates the way Git-LFS works by having a tiny file
checked into the repository and expanding it to a huge file on checkout.

Helped-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Matt Cooper <vtbassmatt@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 t/t1051-large-conversion.sh | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

Comments

Junio C Hamano Oct. 29, 2021, 11 p.m. UTC | #1
"Matt Cooper via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Matt Cooper <vtbassmatt@gmail.com>
>
> The filter system allows for alterations to file contents when they're
> added to the database or workdir. ("Smudge" when moving to the workdir;
> "clean" when moving to the database.) This is used natively to handle CRLF
> to LF conversions. It's also employed by Git-LFS to replace large files
> from the workdir with small tracking files in the repo and vice versa.

Not a huge deal, but make it a habit to spell "working tree" not "workdir",
as someday you'd write end-user facing documentation in our tree ;-).
 
> Git pulls the entire smudged file into memory.

Giving "for what" would be helpful to readers.

    Git reads the entire smudged file into memory to convert it into
    a "clean" form to be used in-core.

> While this is inefficient,
> there's a more insidious problem on some platforms due to inconsistency
> between using unsigned long and size_t for the same type of data (size of
> a file in bytes). On most 64-bit platforms, unsigned long is 64 bits, and
> size_t is typedef'd to unsigned long. On Windows, however, unsigned long is
> only 32 bits (and therefore on 64-bit Windows, size_t is typedef'd to
> unsigned long long in order to be 64 bits).
>
> Practically speaking, this means 64-bit Windows users of Git-LFS can't
> handle files larger than 2^32 bytes. Other 64-bit platforms don't suffer
> this limitation.
>
> This commit introduces a test exposing the issue; future commits make it
> pass. The test simulates the way Git-LFS works by having a tiny file
> checked into the repository and expanding it to a huge file on checkout.
>
> Helped-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> Signed-off-by: Matt Cooper <vtbassmatt@gmail.com>
> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> ---
>  t/t1051-large-conversion.sh | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
>
> diff --git a/t/t1051-large-conversion.sh b/t/t1051-large-conversion.sh
> index 8b7640b3ba8..bff86c13208 100755
> --- a/t/t1051-large-conversion.sh
> +++ b/t/t1051-large-conversion.sh
> @@ -83,4 +83,18 @@ test_expect_success 'ident converts on output' '
>  	test_cmp small.clean large.clean
>  '
>  
> +# This smudge filter prepends 5GB of zeros to the file it checks out. This
> +# ensures that smudging doesn't mangle large files on 64-bit Windows.
> +test_expect_failure EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
> +		'files over 4GB convert on output' '
> +	test_commit test small "a small file" &&
> +	test_config filter.makelarge.smudge \
> +		"test-tool genzeros $((5*1024*1024*1024)) && cat" &&
> +	echo "small filter=makelarge" >.gitattributes &&
> +	rm small &&
> +	git checkout -- small &&
> +	size=$(test_file_size small) &&
> +	test "$size" -ge $((5 * 1024 * 1024 * 1024))
> +'

Why not exactly 5G, but anything that is at least 5G is OK?

Thanks.

>  test_done
Junio C Hamano Oct. 29, 2021, 11:21 p.m. UTC | #2
Junio C Hamano <gitster@pobox.com> writes:

>> +# This smudge filter prepends 5GB of zeros to the file it checks out. This
>> +# ensures that smudging doesn't mangle large files on 64-bit Windows.
>> +test_expect_failure EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
>> +		'files over 4GB convert on output' '
>> +	test_commit test small "a small file" &&
>> +	test_config filter.makelarge.smudge \
>> +		"test-tool genzeros $((5*1024*1024*1024)) && cat" &&
>> +	echo "small filter=makelarge" >.gitattributes &&
>> +	rm small &&
>> +	git checkout -- small &&
>> +	size=$(test_file_size small) &&
>> +	test "$size" -ge $((5 * 1024 * 1024 * 1024))
>> +'
>
> Why not exactly 5G, but anything that is at least 5G is OK?

I know it is more than 5G, thanks to the "&& cat".  THe question was
why aren't we measuring the size of "a small file" so that we can
check against an exact size to be expected.

Thanks.
Johannes Schindelin Nov. 2, 2021, 2:56 p.m. UTC | #3
Hi Junio,

On Fri, 29 Oct 2021, Junio C Hamano wrote:

> Junio C Hamano <gitster@pobox.com> writes:
>
> >> +# This smudge filter prepends 5GB of zeros to the file it checks out. This
> >> +# ensures that smudging doesn't mangle large files on 64-bit Windows.
> >> +test_expect_failure EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
> >> +		'files over 4GB convert on output' '
> >> +	test_commit test small "a small file" &&
> >> +	test_config filter.makelarge.smudge \
> >> +		"test-tool genzeros $((5*1024*1024*1024)) && cat" &&
> >> +	echo "small filter=makelarge" >.gitattributes &&
> >> +	rm small &&
> >> +	git checkout -- small &&
> >> +	size=$(test_file_size small) &&
> >> +	test "$size" -ge $((5 * 1024 * 1024 * 1024))
> >> +'
> >
> > Why not exactly 5G, but anything that is at least 5G is OK?
>
> I know it is more than 5G, thanks to the "&& cat".  THe question was
> why aren't we measuring the size of "a small file" so that we can
> check against an exact size to be expected.

The main problem solved by this patch series is the fact that by virtue of
using `unsigned long`, Git truncated the contents to less than 4GB. So,
really, what we care about here is that that does not happen anymore.

I will change it to take the extra time to determine `small`'s file size
and add that to 5GB, to avoid confusing readers, though.

Ciao,
Dscho
Johannes Schindelin Nov. 2, 2021, 2:57 p.m. UTC | #4
Hi Junio,

On Fri, 29 Oct 2021, Junio C Hamano wrote:

> "Matt Cooper via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: Matt Cooper <vtbassmatt@gmail.com>
> >
> > The filter system allows for alterations to file contents when they're
> > added to the database or workdir. ("Smudge" when moving to the workdir;
> > "clean" when moving to the database.) This is used natively to handle CRLF
> > to LF conversions. It's also employed by Git-LFS to replace large files
> > from the workdir with small tracking files in the repo and vice versa.
>
> Not a huge deal, but make it a habit to spell "working tree" not "workdir",
> as someday you'd write end-user facing documentation in our tree ;-).
>
> > Git pulls the entire smudged file into memory.
>
> Giving "for what" would be helpful to readers.
>
>     Git reads the entire smudged file into memory to convert it into
>     a "clean" form to be used in-core.

I changed both.

Ciao,
Dscho
diff mbox series

Patch

diff --git a/t/t1051-large-conversion.sh b/t/t1051-large-conversion.sh
index 8b7640b3ba8..bff86c13208 100755
--- a/t/t1051-large-conversion.sh
+++ b/t/t1051-large-conversion.sh
@@ -83,4 +83,18 @@  test_expect_success 'ident converts on output' '
 	test_cmp small.clean large.clean
 '
 
+# This smudge filter prepends 5GB of zeros to the file it checks out. This
+# ensures that smudging doesn't mangle large files on 64-bit Windows.
+test_expect_failure EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
+		'files over 4GB convert on output' '
+	test_commit test small "a small file" &&
+	test_config filter.makelarge.smudge \
+		"test-tool genzeros $((5*1024*1024*1024)) && cat" &&
+	echo "small filter=makelarge" >.gitattributes &&
+	rm small &&
+	git checkout -- small &&
+	size=$(test_file_size small) &&
+	test "$size" -ge $((5 * 1024 * 1024 * 1024))
+'
+
 test_done