diff mbox series

[v2,2/2] Prevent git from rehashing 4GiB files

Message ID 20231012160930.330618-3-sandals@crustytoothpaste.net (mailing list archive)
State Accepted
Commit 5143ac07b17e2b025865378fce24cc11ac7bf8b1
Headers show
Series Prevent re-reading 4 GiB files on every status | expand

Commit Message

brian m. carlson Oct. 12, 2023, 4:09 p.m. UTC
From: Jason Hatton <jhatton@globalfinishing.com>

The index stores file sizes using a uint32_t. This causes any file
that is a multiple of 2^32 to have a cached file size of zero.
Zero is a special value used by racily clean. This causes git to
rehash every file that is a multiple of 2^32 every time git status
or git commit is run.

This patch mitigates the problem by making all files that are a
multiple of 2^32 appear to have a size of 1<<31 instead of zero.

The value of 1<<31 is chosen to keep it as far away from zero
as possible to help prevent things getting mixed up with unpatched
versions of git.

An example would be to have a 2^32 sized file in the index of
patched git. Patched git would save the file as 2^31 in the cache.
An unpatched git would very much see the file has changed in size
and force it to rehash the file, which is safe. The file would
have to grow or shrink by exactly 2^31 and retain all of its
ctime, mtime, and other attributes for old git to not notice
the change.

This patch does not change the behavior of any file that is not
an exact multiple of 2^32.

Signed-off-by: Jason D. Hatton <jhatton@globalfinishing.com>
Signed-off-by: brian m. carlson <bk2204@github.com>
---
 statinfo.c        | 20 ++++++++++++++++++--
 t/t7508-status.sh | 16 ++++++++++++++++
 2 files changed, 34 insertions(+), 2 deletions(-)

Comments

Junio C Hamano Oct. 12, 2023, 5:58 p.m. UTC | #1
"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> An example would be to have a 2^32 sized file in the index of
> patched git. Patched git would save the file as 2^31 in the cache.
> An unpatched git would very much see the file has changed in size
> and force it to rehash the file, which is safe.

The reason why this is "safe" is because an older Git will would
keep rehashing whether 2^31 or 0 is stored as its sd_size, so the
change is not making things worse?  With older git, "git diff-files"
will report that such a file is not up to date, and then the user
will refresh the index, which will store 0 as its sd_file, so
tentatively "git status" may give a wrong information, but we
probalby do not care?  Is that how the reasoning goes?

> +/*
> + * Munge st_size into an unsigned int.
> + */
> +static unsigned int munge_st_size(off_t st_size) {
> +	unsigned int sd_size = st_size;
> +
> +	/*
> +	 * If the file is an exact multiple of 4 GiB, modify the value so it
> +	 * doesn't get marked as racily clean (zero).
> +	 */
> +	if (!sd_size && st_size)
> +		return 0x80000000;
> +	else
> +		return sd_size;
> +}

This assumes typeof(sd_size) aka "unsigned int" is always 32-bit,
which does not sound reasonable.  Reference to 4GiB, 2^32 and 2^31
in the code and the proposed commit log message need to be qualified
with "on a platform whose uint is 32-bit" or something, or better
yet, phrased in a way that is agnostic to the integer size.  At
the very least, the hardcoded 0x80000000 needs to be rethought, I
am afraid.

Other than that, the workaround for the racy-git avoidance code does
sound good.  I actually wonder if we should always use 1 regardless
of integer size.
brian m. carlson Oct. 12, 2023, 9:58 p.m. UTC | #2
On 2023-10-12 at 17:58:42, Junio C Hamano wrote:
> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> 
> > An example would be to have a 2^32 sized file in the index of
> > patched git. Patched git would save the file as 2^31 in the cache.
> > An unpatched git would very much see the file has changed in size
> > and force it to rehash the file, which is safe.
> 
> The reason why this is "safe" is because an older Git will would
> keep rehashing whether 2^31 or 0 is stored as its sd_size, so the
> change is not making things worse?  With older git, "git diff-files"
> will report that such a file is not up to date, and then the user
> will refresh the index, which will store 0 as its sd_file, so
> tentatively "git status" may give a wrong information, but we
> probalby do not care?  Is that how the reasoning goes?

Correct.  An older Git will rehash it either way, on every git status.
If you run git diff-files on an older Git, it will always show it as
modified (hence why I used that in the tests).  The git status will
rehash it and then realize that it isn't modified, not printing any
changes, so the behaviour is not wrong, it's just extremely slow (SHA-1
DC on 4 GiB of data).

If you use a new Git with an old index, git status will still rehash it
the first time and correctly determine that it isn't changed, then write
the 0x80000000 to the index.  It just won't rehash it on subsequent
calls to git status.

The only incorrect behaviour (assuming that slow is not incorrect) that
I'm aware of on older Git is that git diff-files marks it as modified
when it's not.  There is no incorrect behaviour on a fixed Git.

> > +/*
> > + * Munge st_size into an unsigned int.
> > + */
> > +static unsigned int munge_st_size(off_t st_size) {
> > +	unsigned int sd_size = st_size;
> > +
> > +	/*
> > +	 * If the file is an exact multiple of 4 GiB, modify the value so it
> > +	 * doesn't get marked as racily clean (zero).
> > +	 */
> > +	if (!sd_size && st_size)
> > +		return 0x80000000;
> > +	else
> > +		return sd_size;
> > +}
> 
> This assumes typeof(sd_size) aka "unsigned int" is always 32-bit,
> which does not sound reasonable.  Reference to 4GiB, 2^32 and 2^31
> in the code and the proposed commit log message need to be qualified
> with "on a platform whose uint is 32-bit" or something, or better
> yet, phrased in a way that is agnostic to the integer size.  At
> the very least, the hardcoded 0x80000000 needs to be rethought, I
> am afraid.

unsigned int is 32-bit on every modern 32- or 64-bit platform known to
exist today.  I don't believe we support MS-DOS or systems of its era,
nor should we support 8- or 16-bit systems.  If you'd prefer "uint32_t",
I can do that.

Note that the problem is reproducible on a stock Ubuntu amd64 system, so
it is definitely a problem on all modern systems.

> Other than that, the workaround for the racy-git avoidance code does
> sound good.  I actually wonder if we should always use 1 regardless
> of integer size.

I think using 2^31 is better because it's far away from very small
values and very large values, which means that it's a hard to modify a
file which used to have its value as a multiple of 4 GiB and
accidentally make Git think it was unchanged.  Using 1 would make a
simple "echo > foo" possibly think things were unchanged in some cases,
which we should avoid.
Junio C Hamano Oct. 12, 2023, 10:11 p.m. UTC | #3
"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> nor should we support 8- or 16-bit systems.  If you'd prefer "uint32_t",
> I can do that.

Going into statinfo.h and updating sd_size to uint32_t is totally
outside the scope of this fix.

> I think using 2^31 is better because it's far away from very small
> values and very large values, which means that it's a hard to modify a
> file which used to have its value as a multiple of 4 GiB and
> accidentally make Git think it was unchanged.  Using 1 would make a
> simple "echo > foo" possibly think things were unchanged in some cases,
> which we should avoid.

The reason I gave the extreme "1-byte" was to gauge how much actual
"size" we are relying on the correctness of this change.  As mtime
is giving the primary protection from false matching of cached stat
information, I do not think "echo >foo" would be a huge issue.  IOW
my stance is 1U<<31 is as good as 1U<<0, so I do not oppose to the
former, either.  But in a few years, 64-bit integers may cease to be
too odd, who knows ;-)
Jeff King Oct. 17, 2023, midnight UTC | #4
On Thu, Oct 12, 2023 at 04:09:30PM +0000, brian m. carlson wrote:

> +static unsigned int munge_st_size(off_t st_size) {
> +	unsigned int sd_size = st_size;
> +
> +	/*
> +	 * If the file is an exact multiple of 4 GiB, modify the value so it
> +	 * doesn't get marked as racily clean (zero).
> +	 */
> +	if (!sd_size && st_size)
> +		return 0x80000000;
> +	else
> +		return sd_size;
> +}

Coverity complained that the "true" side of this conditional is
unreachable, since sd_size is assigned from st_size, so the two values
cannot be both true and false. But obviously we are depending here on
the truncation of off_t to "unsigned int". I'm not sure if Coverity is
just dumb, or if it somehow has a different size for off_t.

I don't _think_ this would ever cause confusion in a real compiler, as
assignment from a larger type to a smaller has well-defined truncation,
as far as I know.

But I do wonder if an explicit "& 0xFFFFFFFF" would make it more obvious
what is happening (which would also do the right thing if in some
hypothetical platform "unsigned int" ended up larger than 32 bits).

-Peff
Jason Hatton Oct. 17, 2023, 2:49 p.m. UTC | #5
From: Jeff King <peff@peff.net>

> On Thu, Oct 12, 2023 at 04:09:30PM +0000, brian m. carlson wrote:
> 
> > +static unsigned int munge_st_size(off_t st_size) {
> > +	unsigned int sd_size = st_size;
> > +
> > +	/*
> > +	 * If the file is an exact multiple of 4 GiB, modify the value so it
> > +	 * doesn't get marked as racily clean (zero).
> > +	 */
> > +	if (!sd_size && st_size)
> > +		return 0x80000000;
> > +	else
> > +		return sd_size;
> > +}
> 
> Coverity complained that the "true" side of this conditional is
> unreachable, since sd_size is assigned from st_size, so the two values
> cannot be both true and false. But obviously we are depending here on
> the truncation of off_t to "unsigned int". I'm not sure if Coverity is
> just dumb, or if it somehow has a different size for off_t.
> 
> I don't _think_ this would ever cause confusion in a real compiler, as
> assignment from a larger type to a smaller has well-defined truncation,
> as far as I know.
> 
> But I do wonder if an explicit "& 0xFFFFFFFF" would make it more obvious
> what is happening (which would also do the right thing if in some
> hypothetical platform "unsigned int" ended up larger than 32 bits).
> 
> -Peff

I originally wrote the code this way to work exactly like the original
code with one exception: Never truncate a nonzero st_size to a zero
sd_size. The original code is here in fill_stat_data:

I was attempting to use exactly the same implicit type conversion and
types as the original.

We could probably write the below to do the same thing.

void fill_stat_data(struct stat_data *sd, struct stat *st)
{
      sd->sd_ctime.sec = (unsigned int)st->st_ctime;
      sd->sd_mtime.sec = (unsigned int)st->st_mtime;
      sd->sd_ctime.nsec = ST_CTIME_NSEC(*st);
      sd->sd_mtime.nsec = ST_MTIME_NSEC(*st);
      sd->sd_dev = st->st_dev;
      sd->sd_ino = st->st_ino;
      sd->sd_uid = st->st_uid;
      sd->sd_gid = st->st_gid;
      sd->sd_size = st->st_size;
+      if (sd->sd_size == 0 && st->st_size!= 0) {
+            sd->sd_size = 1;
+      }
}

- Jason D. Hatton
Junio C Hamano Oct. 17, 2023, 5:02 p.m. UTC | #6
Jason Hatton <jhatton@globalfinishing.com> writes:

> We could probably write the below to do the same thing.
>
> void fill_stat_data(struct stat_data *sd, struct stat *st)
> {
>       sd->sd_ctime.sec = (unsigned int)st->st_ctime;
>       sd->sd_mtime.sec = (unsigned int)st->st_mtime;
>       sd->sd_ctime.nsec = ST_CTIME_NSEC(*st);
>       sd->sd_mtime.nsec = ST_MTIME_NSEC(*st);
>       sd->sd_dev = st->st_dev;
>       sd->sd_ino = st->st_ino;
>       sd->sd_uid = st->st_uid;
>       sd->sd_gid = st->st_gid;
>       sd->sd_size = st->st_size;
> +      if (sd->sd_size == 0 && st->st_size!= 0) {
> +            sd->sd_size = 1;
> +      }
> }

The above is a fairly straight-forward inlining (except that it does
explicit comparisons with zero) of the helper function the version
of patch under discussion added, and uses 1 instead of (1<<31) as an
arbigrary nonzero number that can be used to work around the issue.

So I agree with you that it would do the same thing.  I am not
surprised if it also gets scolded by Coverity the same way, though.
brian m. carlson Oct. 18, 2023, 12:42 a.m. UTC | #7
On 2023-10-17 at 00:00:19, Jeff King wrote:
> On Thu, Oct 12, 2023 at 04:09:30PM +0000, brian m. carlson wrote:
> 
> > +static unsigned int munge_st_size(off_t st_size) {
> > +	unsigned int sd_size = st_size;
> > +
> > +	/*
> > +	 * If the file is an exact multiple of 4 GiB, modify the value so it
> > +	 * doesn't get marked as racily clean (zero).
> > +	 */
> > +	if (!sd_size && st_size)
> > +		return 0x80000000;
> > +	else
> > +		return sd_size;
> > +}
> 
> Coverity complained that the "true" side of this conditional is
> unreachable, since sd_size is assigned from st_size, so the two values
> cannot be both true and false. But obviously we are depending here on
> the truncation of off_t to "unsigned int". I'm not sure if Coverity is
> just dumb, or if it somehow has a different size for off_t.

Technically, on 32-bit machines, you can have a 32-bit off_t if you
don't compile with -D_FILE_OFFSET_BITS=64.  However, such a program is
not very useful on modern systems, since it wouldn't be able to handle
files that are 2 GiB or larger, and so everyone considers that a silly
and buggy way to compile programs.

> I don't _think_ this would ever cause confusion in a real compiler, as
> assignment from a larger type to a smaller has well-defined truncation,
> as far as I know.
> 
> But I do wonder if an explicit "& 0xFFFFFFFF" would make it more obvious
> what is happening (which would also do the right thing if in some
> hypothetical platform "unsigned int" ended up larger than 32 bits).

Such a hypothetical platform is of course allowed by the C standard, but
it's also going to run near zero real Unix programs or kernels.  I am at
this point reflecting on the prudent decision Rust made to make
almost everything an explicit bit size (e.g., u32, i64).

Nonetheless, I'll reroll this with the other things you mentioned, and
I'll switch that "unsigned int" above to "uint32_t", which I think makes
this more obvious.

I don't, however, intend to change the constant from 0x8000000 to 1,
because I think that's a poorer choice, but I'll try to explain it
better in the commit message.  (I had originally aimed to avoid editing
it as much as possible since it's not my name on the commit to avoid
Jason getting blamed unnecessarily for any mistakes I might make.)
diff mbox series

Patch

diff --git a/statinfo.c b/statinfo.c
index 17bb8966c3..9367ca099c 100644
--- a/statinfo.c
+++ b/statinfo.c
@@ -2,6 +2,22 @@ 
 #include "environment.h"
 #include "statinfo.h"
 
+/*
+ * Munge st_size into an unsigned int.
+ */
+static unsigned int munge_st_size(off_t st_size) {
+	unsigned int sd_size = st_size;
+
+	/*
+	 * If the file is an exact multiple of 4 GiB, modify the value so it
+	 * doesn't get marked as racily clean (zero).
+	 */
+	if (!sd_size && st_size)
+		return 0x80000000;
+	else
+		return sd_size;
+}
+
 void fill_stat_data(struct stat_data *sd, struct stat *st)
 {
 	sd->sd_ctime.sec = (unsigned int)st->st_ctime;
@@ -12,7 +28,7 @@  void fill_stat_data(struct stat_data *sd, struct stat *st)
 	sd->sd_ino = st->st_ino;
 	sd->sd_uid = st->st_uid;
 	sd->sd_gid = st->st_gid;
-	sd->sd_size = st->st_size;
+	sd->sd_size = munge_st_size(st->st_size);
 }
 
 int match_stat_data(const struct stat_data *sd, struct stat *st)
@@ -51,7 +67,7 @@  int match_stat_data(const struct stat_data *sd, struct stat *st)
 			changed |= INODE_CHANGED;
 #endif
 
-	if (sd->sd_size != (unsigned int) st->st_size)
+	if (sd->sd_size != munge_st_size(st->st_size))
 		changed |= DATA_CHANGED;
 
 	return changed;
diff --git a/t/t7508-status.sh b/t/t7508-status.sh
index 6928fd89f5..6c46648e11 100755
--- a/t/t7508-status.sh
+++ b/t/t7508-status.sh
@@ -1745,4 +1745,20 @@  test_expect_success 'slow status advice when core.untrackedCache true, and fsmon
 	)
 '
 
+test_expect_success EXPENSIVE 'status does not re-read unchanged 4 or 8 GiB file' '
+	(
+		mkdir large-file &&
+		cd large-file &&
+		# Files are 2 GiB, 4 GiB, and 8 GiB sparse files.
+		test-tool truncate file-a 0x080000000 &&
+		test-tool truncate file-b 0x100000000 &&
+		test-tool truncate file-c 0x200000000 &&
+		# This will be slow.
+		git add file-a file-b file-c &&
+		git commit -m "add large files" &&
+		git diff-index HEAD file-a file-b file-c >actual &&
+		test_must_be_empty actual
+	)
+'
+
 test_done