diff mbox series

archive: Store checksum correctly

Message ID 20190723025736.23036-1-mattst88@gmail.com (mailing list archive)
State New, archived
Headers show
Series archive: Store checksum correctly | expand

Commit Message

Matt Turner July 23, 2019, 2:57 a.m. UTC
tar2sqfs (part of https://github.com/topics/tar2sqfs) rejects tarballs
made with git archive with the message

    invalid tar header checksum!

tar2sqfs recomputes the tarball's checksum to verify it. Its checksum
implementation agrees with GNU tar, which contains a comment that states

    Fill in the checksum field.  It's formatted differently from the
    other fields: it has [6] digits, a null, then a space ...

Correcting this allows tar2sqfs to correctly process the tarballs made
by git archive.

Signed-off-by: Matt Turner <mattst88@gmail.com>
---
 archive-tar.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Junio C Hamano July 23, 2019, 4:49 p.m. UTC | #1
Matt Turner <mattst88@gmail.com> writes:

> tar2sqfs (part of https://github.com/topics/tar2sqfs) rejects tarballs
> made with git archive with the message
>
>     invalid tar header checksum!
>
> tar2sqfs recomputes the tarball's checksum to verify it. Its checksum
> implementation agrees with GNU tar, which contains a comment that states
>
>     Fill in the checksum field.  It's formatted differently from the
>     other fields: it has [6] digits, a null, then a space ...
>
> Correcting this allows tar2sqfs to correctly process the tarballs made
> by git archive.
>
> Signed-off-by: Matt Turner <mattst88@gmail.com>
> ---
>  archive-tar.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/archive-tar.c b/archive-tar.c
> index 3e53aac1e6..f9a157bfd1 100644
> --- a/archive-tar.c
> +++ b/archive-tar.c
> @@ -215,7 +215,9 @@ static void prepare_header(struct archiver_args *args,
>  	memcpy(header->magic, "ustar", 6);
>  	memcpy(header->version, "00", 2);
>  
> -	xsnprintf(header->chksum, sizeof(header->chksum), "%07o", ustar_header_chksum(header));
> +	xsnprintf(header->chksum, sizeof(header->chksum), "%06o", ustar_header_chksum(header));
> +	header->chksum[6] = '\0';
> +	header->chksum[7] = ' ';

Wow.  The choice of %07o is as old as the very original of tar-tree
implementation in our codebase, starting at ae64bbc1 ("tar-tree:
Introduce write_entry()", 2006-03-25).

I think the updated behaviour matches Wikipedia [*1*] where it
spells out that 6 octal is followed by a NUL and a SP; it also says
various implementations do not adhere to this format---perhaps they
meant us ;-)

Frustratingly, I couldn't find this spelled out in POSIX [*2*] X-<.
The closest I found there was

    All other fields are leading zero-filled octal numbers using
    digits from the ISO/IEC 646:1991 standard IRV. Each numeric
    field is terminated by one or more <space> or NUL characters.

which sounds as if the standard wanted to be deliberately more
inclusive than what tar2sqfs is insisting on.

I think a change like this would impact kernel.org folks' tarball
uploading tool, but that is not a reason not to apply this patch.



[References].

*1* https://en.wikipedia.org/wiki/Tar_(computing)
*2* http://pubs.opengroup.org/onlinepubs/9699919799/utilities/pax.html
Jeff King July 23, 2019, 7:31 p.m. UTC | #2
On Tue, Jul 23, 2019 at 09:49:44AM -0700, Junio C Hamano wrote:

> I think a change like this would impact kernel.org folks' tarball
> uploading tool, but that is not a reason not to apply this patch.

That was my thought, too. Distro projects like homebrew rely on
stable hashes of upstream tarballs, and they get cranky when those
tarballs change. Ideally the projects they package would provide
byte-stable tarballs, but many of them rely on on-the-fly tarball
generation by hosting sites like GitHub.

Which isn't to say we should never fix bugs in the tarballs that we
produce, but it's not entirely clear to me that this _is_ a bug, and not
just one tool being overly picky.

-Peff
René Scharfe July 23, 2019, 7:38 p.m. UTC | #3
Am 23.07.19 um 18:49 schrieb Junio C Hamano:
> Matt Turner <mattst88@gmail.com> writes:
>
>> tar2sqfs (part of https://github.com/topics/tar2sqfs) rejects tarballs
>> made with git archive with the message
>>
>>     invalid tar header checksum!
>>
>> tar2sqfs recomputes the tarball's checksum to verify it. Its checksum
>> implementation agrees with GNU tar, which contains a comment that states
>>
>>     Fill in the checksum field.  It's formatted differently from the
>>     other fields: it has [6] digits, a null, then a space ...
>>
>> Correcting this allows tar2sqfs to correctly process the tarballs made
>> by git archive.
>>
>> Signed-off-by: Matt Turner <mattst88@gmail.com>
>> ---
>>  archive-tar.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/archive-tar.c b/archive-tar.c
>> index 3e53aac1e6..f9a157bfd1 100644
>> --- a/archive-tar.c
>> +++ b/archive-tar.c
>> @@ -215,7 +215,9 @@ static void prepare_header(struct archiver_args *args,
>>  	memcpy(header->magic, "ustar", 6);
>>  	memcpy(header->version, "00", 2);
>>
>> -	xsnprintf(header->chksum, sizeof(header->chksum), "%07o", ustar_header_chksum(header));
>> +	xsnprintf(header->chksum, sizeof(header->chksum), "%06o", ustar_header_chksum(header));
>> +	header->chksum[6] = '\0';
>> +	header->chksum[7] = ' ';
>
> Wow.  The choice of %07o is as old as the very original of tar-tree
> implementation in our codebase, starting at ae64bbc1 ("tar-tree:
> Introduce write_entry()", 2006-03-25).

Actually it's already in 731ab9ccf2 ("[PATCH] create tar archives of
tree on the fly", 2005-04-28).

> I think the updated behaviour matches Wikipedia [*1*] where it
> spells out that 6 octal is followed by a NUL and a SP; it also says
> various implementations do not adhere to this format---perhaps they
> meant us ;-)

OpenBSD's pax(1) does the same if I read
https://github.com/openbsd/src/blob/master/bin/pax/tar.c correctly.

> Frustratingly, I couldn't find this spelled out in POSIX [*2*] X-<.
> The closest I found there was
>
>     All other fields are leading zero-filled octal numbers using
>     digits from the ISO/IEC 646:1991 standard IRV. Each numeric
>     field is terminated by one or more <space> or NUL characters.
>
> which sounds as if the standard wanted to be deliberately more
> inclusive than what tar2sqfs is insisting on.

Yes, I remember following that specific paragraph when writing the
code.  The wording probably tries to account for the different
variants invented by vendors over the decades.

is_checksum_valid() in
https://github.com/AgentD/squashfs-tools-ng/blob/master/lib/tar/checksum.c
compares the formatted checksum byte-by-byte.  That seems
unnecessarily strict.  Parsing and comparing the numerical value
of the checksum would be more forgiving, better adhere to POSIX and
might be a tiny bit quicker.

René
René Scharfe July 23, 2019, 8:08 p.m. UTC | #4
Am 23.07.19 um 21:38 schrieb René Scharfe:
> is_checksum_valid() in
> https://github.com/AgentD/squashfs-tools-ng/blob/master/lib/tar/checksum.c
> compares the formatted checksum byte-by-byte.  That seems
> unnecessarily strict.  Parsing and comparing the numerical value
> of the checksum would be more forgiving, better adhere to POSIX and
> might be a tiny bit quicker.

I mean something like the patch below.  Code and text size are bigger,
but it's more lenient and writes less.  Untested.

(Side note: I'm a bit surprised that GCC 8.3 adds the eight spaces one
 by one in the middle loop with -O2..)

---
 lib/tar/checksum.c | 31 ++++++++++++++++++++++---------
 1 file changed, 22 insertions(+), 9 deletions(-)

diff --git a/lib/tar/checksum.c b/lib/tar/checksum.c
index a2a101a..af94ab4 100644
--- a/lib/tar/checksum.c
+++ b/lib/tar/checksum.c
@@ -1,15 +1,27 @@
 /* SPDX-License-Identifier: GPL-3.0-or-later */
 #include "internal.h"

-void update_checksum(tar_header_t *hdr)
+static unsigned int get_checksum(const tar_header_t *hdr)
 {
+	const unsigned char *header_start = (const unsigned char *)hdr;
+	const unsigned char *chksum_start = (const unsigned char *)hdr->chksum;
+	const unsigned char *header_end = header_start + sizeof(*hdr);
+	const unsigned char *chksum_end = chksum_start + sizeof(hdr->chksum);
+	const unsigned char *p;
 	unsigned int chksum = 0;
-	size_t i;

-	memset(hdr->chksum, ' ', sizeof(hdr->chksum));
+	for (p = header_start; p < chksum_start; p++)
+		chksum += *p;
+	for (; p < chksum_end; p++)
+		chksum += ' ';
+	for (; p < header_end; p++)
+		chksum += *p;
+	return chksum;
+}

-	for (i = 0; i < sizeof(*hdr); ++i)
-		chksum += ((unsigned char *)hdr)[i];
+void update_checksum(tar_header_t *hdr)
+{
+	unsigned int chksum = get_checksum(hdr);

 	sprintf(hdr->chksum, "%06o", chksum);
 	hdr->chksum[6] = '\0';
@@ -18,9 +30,10 @@ void update_checksum(tar_header_t *hdr)

 bool is_checksum_valid(const tar_header_t *hdr)
 {
-	tar_header_t copy;
+	unsigned int calculated_chksum = get_checksum(hdr);
+	uint64_t read_chksum;

-	memcpy(&copy, hdr, sizeof(*hdr));
-	update_checksum(&copy);
-	return memcmp(hdr, &copy, sizeof(*hdr)) == 0;
+	if (read_octal(hdr->chksum, sizeof(hdr->chksum), &read_chksum))
+		return 0;
+	return read_chksum == calculated_chksum;
 }
--
2.22.0
Junio C Hamano July 23, 2019, 9:21 p.m. UTC | #5
René Scharfe <l.s.r@web.de> writes:

>> Wow.  The choice of %07o is as old as the very original of tar-tree
>> implementation in our codebase, starting at ae64bbc1 ("tar-tree:
>> Introduce write_entry()", 2006-03-25).
>
> Actually it's already in 731ab9ccf2 ("[PATCH] create tar archives of
> tree on the fly", 2005-04-28).

Yup, after viewing "git show ae64bbc1" I found out the commit added
a new helper to do %07o without touching the existing one that did
the same.  Problem with relying on "git blame" too much X-<.

>> I think the updated behaviour matches Wikipedia [*1*] where it
>> spells out that 6 octal is followed by a NUL and a SP; it also says
>> various implementations do not adhere to this format---perhaps they
>> meant us ;-)
>
> OpenBSD's pax(1) does the same if I read
> https://github.com/openbsd/src/blob/master/bin/pax/tar.c correctly.

What's more interesting is that their verifier in tar_id() compares
the ulong value read from textual checksum with the ulong value
computed.  I agree with you that it would be the more robust way
than what is done by squshfs tools (ng).
David Oberhollenzer July 23, 2019, 9:34 p.m. UTC | #6
Hi,

the implementation of tar2sqfs was primarily based on `man 5 tar`[0],
Wikipedia[1] and trial/error with archives generated by GNU tar and
samples from [2] which I also integrated as test cases.

On 7/23/19 10:08 PM, René Scharfe wrote:
> Am 23.07.19 um 21:38 schrieb René Scharfe:
>> is_checksum_valid() in
>> https://github.com/AgentD/squashfs-tools-ng/blob/master/lib/tar/checksum.c
>> compares the formatted checksum byte-by-byte.  That seems
>> unnecessarily strict.  Parsing and comparing the numerical value
>> of the checksum would be more forgiving, better adhere to POSIX and
>> might be a tiny bit quicker.
>

I agree with that. The current code was probably the simplest way to
move forward with existing code after implementing sqfs2tar and it stayed
that way, since it worked with all archives I tested it with.

As for every file format parser, tar2sqfs should of course be as tolerant
as possible in what it accepts.

> I mean something like the patch below.  Code and text size are bigger,
> but it's more lenient and writes less.  Untested.
> 
I applied and tested the patch. It looks similar to the solution I came up
with and the test cases still pass. Also, on my system tar2sqfs now accepts
git-archive tar balls which it didn't before applying the patch, so I
would go with this.

Thanks,

David


[0] https://www.freebsd.org/cgi/man.cgi?query=tar&sektion=5&manpath=FreeBSD+12.0-RELEASE&arch=default&format=html
[1] https://en.wikipedia.org/wiki/Tar_(computing)#File_format
[2] https://dev.gentoo.org/~mgorny/articles/portability-of-tar-features.html
Jonathan Nieder July 24, 2019, 1:04 a.m. UTC | #7
(cc: René Scharfe, "git archive" expert)
Matt Turner wrote:

> tar2sqfs (part of https://github.com/topics/tar2sqfs) rejects tarballs
> made with git archive with the message
>
>     invalid tar header checksum!
>
> tar2sqfs recomputes the tarball's checksum to verify it. Its checksum
> implementation agrees with GNU tar, which contains a comment that states
>
>     Fill in the checksum field.  It's formatted differently from the
>     other fields: it has [6] digits, a null, then a space ...
>
> Correcting this allows tar2sqfs to correctly process the tarballs made
> by git archive.
>
> Signed-off-by: Matt Turner <mattst88@gmail.com>
> ---
>  archive-tar.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)

Nice.  Is this something that can be covered in tests as well?  (See
t500* for existing "git archive" tests, and see test_lazy_prereq in case
you'd like the test to use an external tool like tar2sqfs that not all
users may have.)

Thanks,
Jonathan

(patch left unsnipped for reference)

> diff --git a/archive-tar.c b/archive-tar.c
> index 3e53aac1e6..f9a157bfd1 100644
> --- a/archive-tar.c
> +++ b/archive-tar.c
> @@ -215,7 +215,9 @@ static void prepare_header(struct archiver_args *args,
>  	memcpy(header->magic, "ustar", 6);
>  	memcpy(header->version, "00", 2);
>  
> -	xsnprintf(header->chksum, sizeof(header->chksum), "%07o", ustar_header_chksum(header));
> +	xsnprintf(header->chksum, sizeof(header->chksum), "%06o", ustar_header_chksum(header));
> +	header->chksum[6] = '\0';
> +	header->chksum[7] = ' ';
>  }
>  
>  static void write_extended_header(struct archiver_args *args,
diff mbox series

Patch

diff --git a/archive-tar.c b/archive-tar.c
index 3e53aac1e6..f9a157bfd1 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -215,7 +215,9 @@  static void prepare_header(struct archiver_args *args,
 	memcpy(header->magic, "ustar", 6);
 	memcpy(header->version, "00", 2);
 
-	xsnprintf(header->chksum, sizeof(header->chksum), "%07o", ustar_header_chksum(header));
+	xsnprintf(header->chksum, sizeof(header->chksum), "%06o", ustar_header_chksum(header));
+	header->chksum[6] = '\0';
+	header->chksum[7] = ' ';
 }
 
 static void write_extended_header(struct archiver_args *args,