diff mbox series

[v4,8/8] clean/smudge: allow clean filters to process extremely large files

Message ID 41fda423982d99847d3879f5ea1eb3570ae9eab6.1635867971.git.gitgitgadget@gmail.com (mailing list archive)
State Accepted
Commit 596b5e77c960cc57ad2e68407b298411ec5e8cb8
Headers show
Series Allow clean/smudge filters to handle huge files in the LLP64 data model | expand

Commit Message

Matt Cooper Nov. 2, 2021, 3:46 p.m. UTC
From: Matt Cooper <vtbassmatt@gmail.com>

The filter system allows for alterations to file contents when they're
moved between the database and the worktree. We already made sure that
it is possible for smudge filters to produce contents that are larger
than `unsigned long` can represent (which matters on systems where
`unsigned long` is narrower than `size_t`, most notably 64-bit Windows).
Now we make sure that clean filters can _consume_ contents that are
larger than that.

Note that this commit only allows clean filters' _input_ to be larger
than can be represented by `unsigned long`.

This change makes only a very minute dent into the much larger project
to teach Git to use `size_t` instead of `unsigned long` wherever
appropriate.

Helped-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Matt Cooper <vtbassmatt@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 convert.c                   |  2 +-
 t/t1051-large-conversion.sh | 11 +++++++++++
 2 files changed, 12 insertions(+), 1 deletion(-)

Comments

Torsten Bögershausen Nov. 2, 2021, 8:47 p.m. UTC | #1
On Tue, Nov 02, 2021 at 03:46:11PM +0000, Matt Cooper via GitGitGadget wrote:
> From: Matt Cooper <vtbassmatt@gmail.com>
>
> The filter system allows for alterations to file contents when they're

Some nit-picking:
looking at
https://git-scm.com/book/en/v2/Customizing-Git-Git-Attributes
we can read
"...substitutions in files on commit/checkout."

Should we use this wording here as well ?


> moved between the database and the worktree. We already made sure that
> it is possible for smudge filters to produce contents that are larger
> than `unsigned long` can represent (which matters on systems where
> `unsigned long` is narrower than `size_t`, most notably 64-bit Windows).
> Now we make sure that clean filters can _consume_ contents that are
> larger than that.
>
> Note that this commit only allows clean filters' _input_ to be larger
> than can be represented by `unsigned long`.
>
> This change makes only a very minute dent into the much larger project
> to teach Git to use `size_t` instead of `unsigned long` wherever
> appropriate.
>
> Helped-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> Signed-off-by: Matt Cooper <vtbassmatt@gmail.com>
> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> ---
>  convert.c                   |  2 +-
>  t/t1051-large-conversion.sh | 11 +++++++++++
>  2 files changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/convert.c b/convert.c
> index fd9c84b0257..5ad6dfc08a0 100644
> --- a/convert.c
> +++ b/convert.c
> @@ -613,7 +613,7 @@ static int crlf_to_worktree(const char *src, size_t len, struct strbuf *buf,
>
>  struct filter_params {
>  	const char *src;
> -	unsigned long size;
> +	size_t size;
>  	int fd;
>  	const char *cmd;
>  	const char *path;
> diff --git a/t/t1051-large-conversion.sh b/t/t1051-large-conversion.sh
> index e6d52f98b15..042b0e44292 100755
> --- a/t/t1051-large-conversion.sh
> +++ b/t/t1051-large-conversion.sh
> @@ -98,4 +98,15 @@ test_expect_success EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
>  	test "$size" -eq $((5 * 1024 * 1024 * 1024 + $small_size))
>  '
>
> +# This clean filter writes down the size of input it receives. By checking against
> +# the actual size, we ensure that cleaning doesn't mangle large files on 64-bit Windows.
> +test_expect_success EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
> +		'files over 4GB convert on input' '
> +	test-tool genzeros $((5*1024*1024*1024)) >big &&
> +	test_config filter.checklarge.clean "wc -c >big.size" &&
> +	echo "big filter=checklarge" >.gitattributes &&
> +	git add big &&
> +	test $(test_file_size big) -eq $(cat big.size)
> +'
> +
>  test_done
> --
> gitgitgadget
Johannes Schindelin Nov. 4, 2021, 12:11 a.m. UTC | #2
Hi Torsten,

On Tue, 2 Nov 2021, Torsten Bögershausen wrote:

> On Tue, Nov 02, 2021 at 03:46:11PM +0000, Matt Cooper via GitGitGadget wrote:
> > From: Matt Cooper <vtbassmatt@gmail.com>
> >
> > The filter system allows for alterations to file contents when they're
>
> Some nit-picking:
> looking at
> https://git-scm.com/book/en/v2/Customizing-Git-Git-Attributes
> we can read
> "...substitutions in files on commit/checkout."
>
> Should we use this wording here as well ?

Again, I believe that Matt's command of the English language is pretty
good (but then, I have the advantage of knowing him and I very much enjoy
learning new English words while chatting with him). I would therefore
chalk it up to artistic license when he uses the word "alterations".

Since you did not comment on the patch, may I assume that you find it
flawless?

Ciao,
Dscho
Torsten Bögershausen Nov. 4, 2021, 8:33 a.m. UTC | #3
On Thu, Nov 04, 2021 at 01:11:44AM +0100, Johannes Schindelin wrote:
> Hi Torsten,
>
> On Tue, 2 Nov 2021, Torsten Bögershausen wrote:
>
> > On Tue, Nov 02, 2021 at 03:46:11PM +0000, Matt Cooper via GitGitGadget wrote:
> > > From: Matt Cooper <vtbassmatt@gmail.com>
> > >
> > > The filter system allows for alterations to file contents when they're
> >
> > Some nit-picking:
> > looking at
> > https://git-scm.com/book/en/v2/Customizing-Git-Git-Attributes
> > we can read
> > "...substitutions in files on commit/checkout."
> >
> > Should we use this wording here as well ?
>
> Again, I believe that Matt's command of the English language is pretty
> good (but then, I have the advantage of knowing him and I very much enjoy
> learning new English words while chatting with him). I would therefore
> chalk it up to artistic license when he uses the word "alterations".

That was not really what my comment was about.
We have exising documentations about Git at other places, and my question
was if we can/should/could use the same terminolgy here in the commit message
as well.
This could make it easier for readers, if the same words are used for the same
thing.

>
> Since you did not comment on the patch, may I assume that you find it
> flawless?

I could not find any flaws.

>
> Ciao,
> Dscho
Junio C Hamano Nov. 4, 2021, 5:26 p.m. UTC | #4
"Matt Cooper via GitGitGadget" <gitgitgadget@gmail.com> writes:

>  struct filter_params {
>  	const char *src;
> -	unsigned long size;
> +	size_t size;
>  	int fd;
>  	const char *cmd;
>  	const char *path;

OK, this member is used in only two places in the file.  One is used
as a parameter to write_in_full() that is prepared to take size_t.
The other is to assign to this member from a size_t parameter the
apply_single_file_filter() function got from the caller, and the
callchain leading down to the function are size_t aware.

So, this may only make "a small dent" as described in the proposed
log message, but it does move it in the right direction.

Thanks.
diff mbox series

Patch

diff --git a/convert.c b/convert.c
index fd9c84b0257..5ad6dfc08a0 100644
--- a/convert.c
+++ b/convert.c
@@ -613,7 +613,7 @@  static int crlf_to_worktree(const char *src, size_t len, struct strbuf *buf,
 
 struct filter_params {
 	const char *src;
-	unsigned long size;
+	size_t size;
 	int fd;
 	const char *cmd;
 	const char *path;
diff --git a/t/t1051-large-conversion.sh b/t/t1051-large-conversion.sh
index e6d52f98b15..042b0e44292 100755
--- a/t/t1051-large-conversion.sh
+++ b/t/t1051-large-conversion.sh
@@ -98,4 +98,15 @@  test_expect_success EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
 	test "$size" -eq $((5 * 1024 * 1024 * 1024 + $small_size))
 '
 
+# This clean filter writes down the size of input it receives. By checking against
+# the actual size, we ensure that cleaning doesn't mangle large files on 64-bit Windows.
+test_expect_success EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
+		'files over 4GB convert on input' '
+	test-tool genzeros $((5*1024*1024*1024)) >big &&
+	test_config filter.checklarge.clean "wc -c >big.size" &&
+	echo "big filter=checklarge" >.gitattributes &&
+	git add big &&
+	test $(test_file_size big) -eq $(cat big.size)
+'
+
 test_done