Message ID | 20190723025736.23036-1-mattst88@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | archive: Store checksum correctly | expand |
Matt Turner <mattst88@gmail.com> writes: > tar2sqfs (part of https://github.com/topics/tar2sqfs) rejects tarballs > made with git archive with the message > > invalid tar header checksum! > > tar2sqfs recomputes the tarball's checksum to verify it. Its checksum > implementation agrees with GNU tar, which contains a comment that states > > Fill in the checksum field. It's formatted differently from the > other fields: it has [6] digits, a null, then a space ... > > Correcting this allows tar2sqfs to correctly process the tarballs made > by git archive. > > Signed-off-by: Matt Turner <mattst88@gmail.com> > --- > archive-tar.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/archive-tar.c b/archive-tar.c > index 3e53aac1e6..f9a157bfd1 100644 > --- a/archive-tar.c > +++ b/archive-tar.c > @@ -215,7 +215,9 @@ static void prepare_header(struct archiver_args *args, > memcpy(header->magic, "ustar", 6); > memcpy(header->version, "00", 2); > > - xsnprintf(header->chksum, sizeof(header->chksum), "%07o", ustar_header_chksum(header)); > + xsnprintf(header->chksum, sizeof(header->chksum), "%06o", ustar_header_chksum(header)); > + header->chksum[6] = '\0'; > + header->chksum[7] = ' '; Wow. The choice of %07o is as old as the very original of tar-tree implementation in our codebase, starting at ae64bbc1 ("tar-tree: Introduce write_entry()", 2006-03-25). I think the updated behaviour matches Wikipedia [*1*] where it spells out that 6 octal is followed by a NUL and a SP; it also says various implementations do not adhere to this format---perhaps they meant us ;-) Frustratingly, I couldn't find this spelled out in POSIX [*2*] X-<. The closest I found there was All other fields are leading zero-filled octal numbers using digits from the ISO/IEC 646:1991 standard IRV. Each numeric field is terminated by one or more <space> or NUL characters. which sounds as if the standard wanted to be deliberately more inclusive than what tar2sqfs is insisting on. I think a change like this would impact kernel.org folks' tarball uploading tool, but that is not a reason not to apply this patch. [References]. *1* https://en.wikipedia.org/wiki/Tar_(computing) *2* http://pubs.opengroup.org/onlinepubs/9699919799/utilities/pax.html
On Tue, Jul 23, 2019 at 09:49:44AM -0700, Junio C Hamano wrote: > I think a change like this would impact kernel.org folks' tarball > uploading tool, but that is not a reason not to apply this patch. That was my thought, too. Distro projects like homebrew rely on stable hashes of upstream tarballs, and they get cranky when those tarballs change. Ideally the projects they package would provide byte-stable tarballs, but many of them rely on on-the-fly tarball generation by hosting sites like GitHub. Which isn't to say we should never fix bugs in the tarballs that we produce, but it's not entirely clear to me that this _is_ a bug, and not just one tool being overly picky. -Peff
Am 23.07.19 um 18:49 schrieb Junio C Hamano: > Matt Turner <mattst88@gmail.com> writes: > >> tar2sqfs (part of https://github.com/topics/tar2sqfs) rejects tarballs >> made with git archive with the message >> >> invalid tar header checksum! >> >> tar2sqfs recomputes the tarball's checksum to verify it. Its checksum >> implementation agrees with GNU tar, which contains a comment that states >> >> Fill in the checksum field. It's formatted differently from the >> other fields: it has [6] digits, a null, then a space ... >> >> Correcting this allows tar2sqfs to correctly process the tarballs made >> by git archive. >> >> Signed-off-by: Matt Turner <mattst88@gmail.com> >> --- >> archive-tar.c | 4 +++- >> 1 file changed, 3 insertions(+), 1 deletion(-) >> >> diff --git a/archive-tar.c b/archive-tar.c >> index 3e53aac1e6..f9a157bfd1 100644 >> --- a/archive-tar.c >> +++ b/archive-tar.c >> @@ -215,7 +215,9 @@ static void prepare_header(struct archiver_args *args, >> memcpy(header->magic, "ustar", 6); >> memcpy(header->version, "00", 2); >> >> - xsnprintf(header->chksum, sizeof(header->chksum), "%07o", ustar_header_chksum(header)); >> + xsnprintf(header->chksum, sizeof(header->chksum), "%06o", ustar_header_chksum(header)); >> + header->chksum[6] = '\0'; >> + header->chksum[7] = ' '; > > Wow. The choice of %07o is as old as the very original of tar-tree > implementation in our codebase, starting at ae64bbc1 ("tar-tree: > Introduce write_entry()", 2006-03-25). Actually it's already in 731ab9ccf2 ("[PATCH] create tar archives of tree on the fly", 2005-04-28). > I think the updated behaviour matches Wikipedia [*1*] where it > spells out that 6 octal is followed by a NUL and a SP; it also says > various implementations do not adhere to this format---perhaps they > meant us ;-) OpenBSD's pax(1) does the same if I read https://github.com/openbsd/src/blob/master/bin/pax/tar.c correctly. > Frustratingly, I couldn't find this spelled out in POSIX [*2*] X-<. > The closest I found there was > > All other fields are leading zero-filled octal numbers using > digits from the ISO/IEC 646:1991 standard IRV. Each numeric > field is terminated by one or more <space> or NUL characters. > > which sounds as if the standard wanted to be deliberately more > inclusive than what tar2sqfs is insisting on. Yes, I remember following that specific paragraph when writing the code. The wording probably tries to account for the different variants invented by vendors over the decades. is_checksum_valid() in https://github.com/AgentD/squashfs-tools-ng/blob/master/lib/tar/checksum.c compares the formatted checksum byte-by-byte. That seems unnecessarily strict. Parsing and comparing the numerical value of the checksum would be more forgiving, better adhere to POSIX and might be a tiny bit quicker. René
Am 23.07.19 um 21:38 schrieb René Scharfe: > is_checksum_valid() in > https://github.com/AgentD/squashfs-tools-ng/blob/master/lib/tar/checksum.c > compares the formatted checksum byte-by-byte. That seems > unnecessarily strict. Parsing and comparing the numerical value > of the checksum would be more forgiving, better adhere to POSIX and > might be a tiny bit quicker. I mean something like the patch below. Code and text size are bigger, but it's more lenient and writes less. Untested. (Side note: I'm a bit surprised that GCC 8.3 adds the eight spaces one by one in the middle loop with -O2..) --- lib/tar/checksum.c | 31 ++++++++++++++++++++++--------- 1 file changed, 22 insertions(+), 9 deletions(-) diff --git a/lib/tar/checksum.c b/lib/tar/checksum.c index a2a101a..af94ab4 100644 --- a/lib/tar/checksum.c +++ b/lib/tar/checksum.c @@ -1,15 +1,27 @@ /* SPDX-License-Identifier: GPL-3.0-or-later */ #include "internal.h" -void update_checksum(tar_header_t *hdr) +static unsigned int get_checksum(const tar_header_t *hdr) { + const unsigned char *header_start = (const unsigned char *)hdr; + const unsigned char *chksum_start = (const unsigned char *)hdr->chksum; + const unsigned char *header_end = header_start + sizeof(*hdr); + const unsigned char *chksum_end = chksum_start + sizeof(hdr->chksum); + const unsigned char *p; unsigned int chksum = 0; - size_t i; - memset(hdr->chksum, ' ', sizeof(hdr->chksum)); + for (p = header_start; p < chksum_start; p++) + chksum += *p; + for (; p < chksum_end; p++) + chksum += ' '; + for (; p < header_end; p++) + chksum += *p; + return chksum; +} - for (i = 0; i < sizeof(*hdr); ++i) - chksum += ((unsigned char *)hdr)[i]; +void update_checksum(tar_header_t *hdr) +{ + unsigned int chksum = get_checksum(hdr); sprintf(hdr->chksum, "%06o", chksum); hdr->chksum[6] = '\0'; @@ -18,9 +30,10 @@ void update_checksum(tar_header_t *hdr) bool is_checksum_valid(const tar_header_t *hdr) { - tar_header_t copy; + unsigned int calculated_chksum = get_checksum(hdr); + uint64_t read_chksum; - memcpy(©, hdr, sizeof(*hdr)); - update_checksum(©); - return memcmp(hdr, ©, sizeof(*hdr)) == 0; + if (read_octal(hdr->chksum, sizeof(hdr->chksum), &read_chksum)) + return 0; + return read_chksum == calculated_chksum; } -- 2.22.0
René Scharfe <l.s.r@web.de> writes: >> Wow. The choice of %07o is as old as the very original of tar-tree >> implementation in our codebase, starting at ae64bbc1 ("tar-tree: >> Introduce write_entry()", 2006-03-25). > > Actually it's already in 731ab9ccf2 ("[PATCH] create tar archives of > tree on the fly", 2005-04-28). Yup, after viewing "git show ae64bbc1" I found out the commit added a new helper to do %07o without touching the existing one that did the same. Problem with relying on "git blame" too much X-<. >> I think the updated behaviour matches Wikipedia [*1*] where it >> spells out that 6 octal is followed by a NUL and a SP; it also says >> various implementations do not adhere to this format---perhaps they >> meant us ;-) > > OpenBSD's pax(1) does the same if I read > https://github.com/openbsd/src/blob/master/bin/pax/tar.c correctly. What's more interesting is that their verifier in tar_id() compares the ulong value read from textual checksum with the ulong value computed. I agree with you that it would be the more robust way than what is done by squshfs tools (ng).
Hi, the implementation of tar2sqfs was primarily based on `man 5 tar`[0], Wikipedia[1] and trial/error with archives generated by GNU tar and samples from [2] which I also integrated as test cases. On 7/23/19 10:08 PM, René Scharfe wrote: > Am 23.07.19 um 21:38 schrieb René Scharfe: >> is_checksum_valid() in >> https://github.com/AgentD/squashfs-tools-ng/blob/master/lib/tar/checksum.c >> compares the formatted checksum byte-by-byte. That seems >> unnecessarily strict. Parsing and comparing the numerical value >> of the checksum would be more forgiving, better adhere to POSIX and >> might be a tiny bit quicker. > I agree with that. The current code was probably the simplest way to move forward with existing code after implementing sqfs2tar and it stayed that way, since it worked with all archives I tested it with. As for every file format parser, tar2sqfs should of course be as tolerant as possible in what it accepts. > I mean something like the patch below. Code and text size are bigger, > but it's more lenient and writes less. Untested. > I applied and tested the patch. It looks similar to the solution I came up with and the test cases still pass. Also, on my system tar2sqfs now accepts git-archive tar balls which it didn't before applying the patch, so I would go with this. Thanks, David [0] https://www.freebsd.org/cgi/man.cgi?query=tar&sektion=5&manpath=FreeBSD+12.0-RELEASE&arch=default&format=html [1] https://en.wikipedia.org/wiki/Tar_(computing)#File_format [2] https://dev.gentoo.org/~mgorny/articles/portability-of-tar-features.html
(cc: René Scharfe, "git archive" expert) Matt Turner wrote: > tar2sqfs (part of https://github.com/topics/tar2sqfs) rejects tarballs > made with git archive with the message > > invalid tar header checksum! > > tar2sqfs recomputes the tarball's checksum to verify it. Its checksum > implementation agrees with GNU tar, which contains a comment that states > > Fill in the checksum field. It's formatted differently from the > other fields: it has [6] digits, a null, then a space ... > > Correcting this allows tar2sqfs to correctly process the tarballs made > by git archive. > > Signed-off-by: Matt Turner <mattst88@gmail.com> > --- > archive-tar.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) Nice. Is this something that can be covered in tests as well? (See t500* for existing "git archive" tests, and see test_lazy_prereq in case you'd like the test to use an external tool like tar2sqfs that not all users may have.) Thanks, Jonathan (patch left unsnipped for reference) > diff --git a/archive-tar.c b/archive-tar.c > index 3e53aac1e6..f9a157bfd1 100644 > --- a/archive-tar.c > +++ b/archive-tar.c > @@ -215,7 +215,9 @@ static void prepare_header(struct archiver_args *args, > memcpy(header->magic, "ustar", 6); > memcpy(header->version, "00", 2); > > - xsnprintf(header->chksum, sizeof(header->chksum), "%07o", ustar_header_chksum(header)); > + xsnprintf(header->chksum, sizeof(header->chksum), "%06o", ustar_header_chksum(header)); > + header->chksum[6] = '\0'; > + header->chksum[7] = ' '; > } > > static void write_extended_header(struct archiver_args *args,
diff --git a/archive-tar.c b/archive-tar.c index 3e53aac1e6..f9a157bfd1 100644 --- a/archive-tar.c +++ b/archive-tar.c @@ -215,7 +215,9 @@ static void prepare_header(struct archiver_args *args, memcpy(header->magic, "ustar", 6); memcpy(header->version, "00", 2); - xsnprintf(header->chksum, sizeof(header->chksum), "%07o", ustar_header_chksum(header)); + xsnprintf(header->chksum, sizeof(header->chksum), "%06o", ustar_header_chksum(header)); + header->chksum[6] = '\0'; + header->chksum[7] = ' '; } static void write_extended_header(struct archiver_args *args,
tar2sqfs (part of https://github.com/topics/tar2sqfs) rejects tarballs made with git archive with the message invalid tar header checksum! tar2sqfs recomputes the tarball's checksum to verify it. Its checksum implementation agrees with GNU tar, which contains a comment that states Fill in the checksum field. It's formatted differently from the other fields: it has [6] digits, a null, then a space ... Correcting this allows tar2sqfs to correctly process the tarballs made by git archive. Signed-off-by: Matt Turner <mattst88@gmail.com> --- archive-tar.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)