[v4,5/8] odb: teach read_blob_entry to use size_t

Message ID	308a8f2a3ade63ef21feb945e45866f2a83ae101.1635867971.git.gitgitgadget@gmail.com (mailing list archive)
State	Accepted
Commit	e9aa762cc72e6cf8fd76fefe5ca2b5064be1a821
Headers	show Return-Path: <git-owner@kernel.org> Message-Id: <308a8f2a3ade63ef21feb945e45866f2a83ae101.1635867971.git.gitgitgadget@gmail.com> In-Reply-To: <pull.1068.v4.git.1635867971.gitgitgadget@gmail.com> References: <pull.1068.v3.git.1635515959.gitgitgadget@gmail.com> <pull.1068.v4.git.1635867971.gitgitgadget@gmail.com> Date: Tue, 02 Nov 2021 15:46:08 +0000 Subject: [PATCH v4 5/8] odb: teach read_blob_entry to use size_t Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 To: git@vger.kernel.org Cc: Carlo Arenas <carenas@gmail.com>, "brian m. carlson" <sandals@crustytoothpaste.net>, Johannes Schindelin <johannes.schindelin@gmail.com>, Philip Oakley <philipoakley@iee.email>, Torsten =?utf-8?q?B=C3=B6gershausen?= <tboegi@web.de>, Johannes Schindelin <johannes.schindelin@gmx.de>, Matt Cooper <vtbassmatt@gmail.com> Precedence: bulk From: Matt Cooper <vtbassmatt@gmail.com>
Series	Allow clean/smudge filters to handle huge files in the LLP64 data model \| expand [v4,0/8] Allow clean/smudge filters to handle huge files in the LLP64 data model [v4,1/8] test-genzeros: allow more than 2G zeros in Windows [v4,2/8] test-tool genzeros: generate large amounts of data more efficiently [v4,3/8] test-lib: add prerequisite for 64-bit platforms [v4,4/8] t1051: introduce a smudge filter test for extremely large files [v4,5/8] odb: teach read_blob_entry to use size_t [v4,6/8] git-compat-util: introduce more size_t helpers [v4,7/8] odb: guard against data loss checking out a huge file [v4,8/8] clean/smudge: allow clean filters to process extremely large files

Message ID

308a8f2a3ade63ef21feb945e45866f2a83ae101.1635867971.git.gitgitgadget@gmail.com (mailing list archive)

State

Accepted

Commit

e9aa762cc72e6cf8fd76fefe5ca2b5064be1a821

Headers

Message-Id: 
 <308a8f2a3ade63ef21feb945e45866f2a83ae101.1635867971.git.gitgitgadget@gmail.com>
In-Reply-To: <pull.1068.v4.git.1635867971.gitgitgadget@gmail.com>
References: <pull.1068.v3.git.1635515959.gitgitgadget@gmail.com>
        <pull.1068.v4.git.1635867971.gitgitgadget@gmail.com>
Date: Tue, 02 Nov 2021 15:46:08 +0000
Subject: [PATCH v4 5/8] odb: teach read_blob_entry to use size_t
Fcc: Sent
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
To: git@vger.kernel.org
Cc: Carlo Arenas <carenas@gmail.com>,
 "brian m. carlson" <sandals@crustytoothpaste.net>,
 Johannes Schindelin <johannes.schindelin@gmail.com>,
 Philip Oakley <philipoakley@iee.email>,
 Torsten =?utf-8?q?B=C3=B6gershausen?= <tboegi@web.de>,
 Johannes Schindelin <johannes.schindelin@gmx.de>,
 Matt Cooper <vtbassmatt@gmail.com>
Precedence: bulk
From: Matt Cooper <vtbassmatt@gmail.com>

Series

Allow clean/smudge filters to handle huge files in the LLP64 data model | expand

Commit Message

Matt Cooper Nov. 2, 2021, 3:46 p.m. UTC

From: Matt Cooper <vtbassmatt@gmail.com>

There is mixed use of size_t and unsigned long to deal with sizes in the
codebase. Recall that Windows defines unsigned long as 32 bits even on
64-bit platforms, meaning that converting size_t to unsigned long narrows
the range. This mostly doesn't cause a problem since Git rarely deals
with files larger than 2^32 bytes.

But adjunct systems such as Git LFS, which use smudge/clean filters to
keep huge files out of the repository, may have huge file contents passed
through some of the functions in entry.c and convert.c. On Windows, this
results in a truncated file being written to the workdir. I traced this to
one specific use of unsigned long in write_entry (and a similar instance
in write_pc_item_to_fd for parallel checkout). That appeared to be for
the call to read_blob_entry, which expects a pointer to unsigned long.

By altering the signature of read_blob_entry to expect a size_t,
write_entry can be switched to use size_t internally (which all of its
callers and most of its callees already used). To avoid touching dozens of
additional files, read_blob_entry uses a local unsigned long to call a
chain of functions which aren't prepared to accept size_t.

Helped-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Matt Cooper <vtbassmatt@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
 entry.c                     | 8 +++++---
 entry.h                     | 2 +-
 parallel-checkout.c         | 2 +-
 t/t1051-large-conversion.sh | 2 +-
 4 files changed, 8 insertions(+), 6 deletions(-)

Comments

Torsten Bögershausen Nov. 2, 2021, 8:40 p.m. UTC | #1

Nicely explained, some comments inline

On Tue, Nov 02, 2021 at 03:46:08PM +0000, Matt Cooper via GitGitGadget wrote:
> From: Matt Cooper <vtbassmatt@gmail.com>
>
> There is mixed use of size_t and unsigned long to deal with sizes in the
> codebase. Recall that Windows defines unsigned long as 32 bits even on
> 64-bit platforms, meaning that converting size_t to unsigned long narrows
> the range. This mostly doesn't cause a problem since Git rarely deals
> with files larger than 2^32 bytes.

What does this mean ?
> ... This mostly doesn't cause a problem since Git rarely deals
> with files larger than 2^32 bytes.

Is "mostly" is a good wording here ?
May be
This doesn't cause a problem when files smaller than 2^32 bytes are handled by Git.

>
> But adjunct systems such as Git LFS, which use smudge/clean filters to
> keep huge files out of the repository, may have huge file contents passed
> through some of the functions in entry.c and convert.c. On Windows, this
> results in a truncated file being written to the workdir. I traced this to
> one specific use of unsigned long in write_entry (and a similar instance
> in write_pc_item_to_fd for parallel checkout). That appeared to be for
> the call to read_blob_entry, which expects a pointer to unsigned long.
>
> By altering the signature of read_blob_entry to expect a size_t,

"expect" -> "use"

(I am not a native English speaker, would "changing" be better than "altering" ?)
 By changing the signature of read_blob_entry to use size_t,



> write_entry can be switched to use size_t internally (which all of its
> callers and most of its callees already used). To avoid touching dozens of
> additional files, read_blob_entry uses a local unsigned long to call a
> chain of functions which aren't prepared to accept size_t.
>
> Helped-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> Signed-off-by: Matt Cooper <vtbassmatt@gmail.com>
> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
> ---
>  entry.c                     | 8 +++++---
>  entry.h                     | 2 +-
>  parallel-checkout.c         | 2 +-
>  t/t1051-large-conversion.sh | 2 +-
>  4 files changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/entry.c b/entry.c
> index 711ee0693c7..4cb3942dbdc 100644
> --- a/entry.c
> +++ b/entry.c
> @@ -82,11 +82,13 @@ static int create_file(const char *path, unsigned int mode)
>  	return open(path, O_WRONLY | O_CREAT | O_EXCL, mode);
>  }
>
> -void *read_blob_entry(const struct cache_entry *ce, unsigned long *size)
> +void *read_blob_entry(const struct cache_entry *ce, size_t *size)
>  {
>  	enum object_type type;
> -	void *blob_data = read_object_file(&ce->oid, &type, size);
> +	unsigned long ul;
> +	void *blob_data = read_object_file(&ce->oid, &type, &ul);
>
> +	*size = ul;
>  	if (blob_data) {
>  		if (type == OBJ_BLOB)
>  			return blob_data;
> @@ -270,7 +272,7 @@ static int write_entry(struct cache_entry *ce, char *path, struct conv_attrs *ca
>  	int fd, ret, fstat_done = 0;
>  	char *new_blob;
>  	struct strbuf buf = STRBUF_INIT;
> -	unsigned long size;
> +	size_t size;
>  	ssize_t wrote;
>  	size_t newsize = 0;
>  	struct stat st;
> diff --git a/entry.h b/entry.h
> index b8c0e170dc7..61ee8c17604 100644
> --- a/entry.h
> +++ b/entry.h
> @@ -51,7 +51,7 @@ int finish_delayed_checkout(struct checkout *state, int *nr_checkouts);
>   */
>  void unlink_entry(const struct cache_entry *ce);
>
> -void *read_blob_entry(const struct cache_entry *ce, unsigned long *size);
> +void *read_blob_entry(const struct cache_entry *ce, size_t *size);
>  int fstat_checkout_output(int fd, const struct checkout *state, struct stat *st);
>  void update_ce_after_write(const struct checkout *state, struct cache_entry *ce,
>  			   struct stat *st);
> diff --git a/parallel-checkout.c b/parallel-checkout.c
> index 6b1af32bb3d..b6f4a25642e 100644
> --- a/parallel-checkout.c
> +++ b/parallel-checkout.c
> @@ -261,7 +261,7 @@ static int write_pc_item_to_fd(struct parallel_checkout_item *pc_item, int fd,
>  	struct stream_filter *filter;
>  	struct strbuf buf = STRBUF_INIT;
>  	char *blob;
> -	unsigned long size;
> +	size_t size;
>  	ssize_t wrote;
>
>  	/* Sanity check */
> diff --git a/t/t1051-large-conversion.sh b/t/t1051-large-conversion.sh
> index e7f9f0bdc56..e6d52f98b15 100755
> --- a/t/t1051-large-conversion.sh
> +++ b/t/t1051-large-conversion.sh
> @@ -85,7 +85,7 @@ test_expect_success 'ident converts on output' '
>
>  # This smudge filter prepends 5GB of zeros to the file it checks out. This
>  # ensures that smudging doesn't mangle large files on 64-bit Windows.
> -test_expect_failure EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
> +test_expect_success EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
>  		'files over 4GB convert on output' '
>  	test_commit test small "a small file" &&
>  	small_size=$(test_file_size small) &&
> --
> gitgitgadget
>

Johannes Schindelin Nov. 4, 2021, 12:09 a.m. UTC | #2

Hi Torsten,

On Tue, 2 Nov 2021, Torsten Bögershausen wrote:

> On Tue, Nov 02, 2021 at 03:46:08PM +0000, Matt Cooper via GitGitGadget wrote:
> > From: Matt Cooper <vtbassmatt@gmail.com>
> >
> > There is mixed use of size_t and unsigned long to deal with sizes in the
> > codebase. Recall that Windows defines unsigned long as 32 bits even on
> > 64-bit platforms, meaning that converting size_t to unsigned long narrows
> > the range. This mostly doesn't cause a problem since Git rarely deals
> > with files larger than 2^32 bytes.
>
> What does this mean ?

I found the explanation to be quite clear (otherwise I would have changed
it before submitting, obviously).

Git is rarely used with large files, as it was meant to track source code
files, and it is pretty rare that a source code file is larger than 4GB.
Therefore, the described issues are edge cases rather than common ones.

> > ... This mostly doesn't cause a problem since Git rarely deals
> > with files larger than 2^32 bytes.
>
> Is "mostly" is a good wording here ?

I do think so.

> May be

s/May be/Maybe/ since you're nitpicking wording ;-)

> This doesn't cause a problem when files smaller than 2^32 bytes are handled by Git.

That would lose the rather important fact that it is common to encounter
only tracked files that are much smaller than 4GB.

> > But adjunct systems such as Git LFS, which use smudge/clean filters to
> > keep huge files out of the repository, may have huge file contents passed
> > through some of the functions in entry.c and convert.c. On Windows, this
> > results in a truncated file being written to the workdir. I traced this to
> > one specific use of unsigned long in write_entry (and a similar instance
> > in write_pc_item_to_fd for parallel checkout). That appeared to be for
> > the call to read_blob_entry, which expects a pointer to unsigned long.
> >
> > By altering the signature of read_blob_entry to expect a size_t,
>
> "expect" -> "use"

I would say that in the context of talking about a signature, "expect" is
a better verb than "use".

But then, just like you I am not a native speaker, so I think we should
maybe stop telling a native speaker like Matt how to use his native
tongue...

> (I am not a native English speaker, would "changing" be better than "altering" ?)
>  By changing the signature of read_blob_entry to use size_t,

As I said, I am not a native English speaker, either. So I believe that
Matt knows better than the two of us together how to phrase things in
English.

Since you had nothing to say about the patch itself, may I assume that
you're fine with it?

Ciao,
Dscho

Philip Oakley Nov. 4, 2021, 12:24 p.m. UTC | #3

On 04/11/2021 00:09, Johannes Schindelin wrote:
> I would say that in the context of talking about a signature, "expect" is
> a better verb than "use".
>
> But then, just like you I am not a native speaker, so I think we should
> maybe stop telling a native speaker like Matt how to use his native
> tongue...
>
>> (I am not a native English speaker, would "changing" be better than "altering" ?)
>>  By changing the signature of read_blob_entry to use size_t,
> As I said, I am not a native English speaker, either. So I believe that
> Matt knows better than the two of us together how to phrase things in
> English.
Being a native speaker can be a bit of a Catch 22, especially in
English, as we can use unusual idioms and find them normal, and often
weren't taught 'grammar'.

It is important that the text is understandable to those for whom
English isn't their first language. At least we don't have to use
simplified English [e.g. 1,2]

Philip

[1] https://en.wikipedia.org/wiki/Simplified_Technical_English
[2]
https://www.boeing.com/company/key-orgs/licensing/simplified-english-checker.page
(not actually used;-)

diff --git a/entry.c b/entry.c
index 711ee0693c7..4cb3942dbdc 100644
--- a/entry.c
+++ b/entry.c
@@ -82,11 +82,13 @@  static int create_file(const char *path, unsigned int mode)
 	return open(path, O_WRONLY | O_CREAT | O_EXCL, mode);
 }
 
-void *read_blob_entry(const struct cache_entry *ce, unsigned long *size)
+void *read_blob_entry(const struct cache_entry *ce, size_t *size)
 {
 	enum object_type type;
-	void *blob_data = read_object_file(&ce->oid, &type, size);
+	unsigned long ul;
+	void *blob_data = read_object_file(&ce->oid, &type, &ul);
 
+	*size = ul;
 	if (blob_data) {
 		if (type == OBJ_BLOB)
 			return blob_data;
@@ -270,7 +272,7 @@  static int write_entry(struct cache_entry *ce, char *path, struct conv_attrs *ca
 	int fd, ret, fstat_done = 0;
 	char *new_blob;
 	struct strbuf buf = STRBUF_INIT;
-	unsigned long size;
+	size_t size;
 	ssize_t wrote;
 	size_t newsize = 0;
 	struct stat st;
diff --git a/entry.h b/entry.h
index b8c0e170dc7..61ee8c17604 100644
--- a/entry.h
+++ b/entry.h
@@ -51,7 +51,7 @@  int finish_delayed_checkout(struct checkout *state, int *nr_checkouts);
  */
 void unlink_entry(const struct cache_entry *ce);
 
-void *read_blob_entry(const struct cache_entry *ce, unsigned long *size);
+void *read_blob_entry(const struct cache_entry *ce, size_t *size);
 int fstat_checkout_output(int fd, const struct checkout *state, struct stat *st);
 void update_ce_after_write(const struct checkout *state, struct cache_entry *ce,
 			   struct stat *st);
diff --git a/parallel-checkout.c b/parallel-checkout.c
index 6b1af32bb3d..b6f4a25642e 100644
--- a/parallel-checkout.c
+++ b/parallel-checkout.c
@@ -261,7 +261,7 @@  static int write_pc_item_to_fd(struct parallel_checkout_item *pc_item, int fd,
 	struct stream_filter *filter;
 	struct strbuf buf = STRBUF_INIT;
 	char *blob;
-	unsigned long size;
+	size_t size;
 	ssize_t wrote;
 
 	/* Sanity check */
diff --git a/t/t1051-large-conversion.sh b/t/t1051-large-conversion.sh
index e7f9f0bdc56..e6d52f98b15 100755
--- a/t/t1051-large-conversion.sh
+++ b/t/t1051-large-conversion.sh
@@ -85,7 +85,7 @@  test_expect_success 'ident converts on output' '
 
 # This smudge filter prepends 5GB of zeros to the file it checks out. This
 # ensures that smudging doesn't mangle large files on 64-bit Windows.
-test_expect_failure EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
+test_expect_success EXPENSIVE,SIZE_T_IS_64BIT,!LONG_IS_64BIT \
 		'files over 4GB convert on output' '
 	test_commit test small "a small file" &&
 	small_size=$(test_file_size small) &&

[v4,5/8] odb: teach read_blob_entry to use size_t

Commit Message

Comments

Patch