diff mbox series

[RFC,1/1] Document a fixed tar format for interoperability

Message ID 20230205221728.4179674-2-sandals@crustytoothpaste.net (mailing list archive)
State New, archived
Headers show
Series Canonical tar format for Git | expand

Commit Message

brian m. carlson Feb. 5, 2023, 10:17 p.m. UTC
Right now, many people wish to have archives which are consistent among
Git versions.  That is not something we currently offer, but users often
rely on the fact that our tar format changes rarely and assume it will
never change, which has caused lots of problems with various sites in
the past.

Instead of letting this go on indefinitely, let's explicitly document a
versioned canonical tar format which is completely reproducible and
which we guarantee will be permanently stable.  This format is more
rigid than the current tar format, but it produces identical results for
identical trees, regardless of hash algorithm, and is easy to implement
in other tools.  This is beneficial because lots of people want fixed,
reproducible archives, and there's little reason to duplicate work.

This format, like the existing format, is actually a pax format archive,
which is an extension to the ustar (Unix standard tar) format.  This
format was documented by POSIX in 2001 and is well understood by most
modern tar implementations, including GNU tar and libarchive, which
covers the versions of tar used on most major operating systems,
including Windows, Linux, macOS, and most BSDs.

The format in this document does mandate a pax header for each file,
which slightly increases the size of the archive.  However, to properly
embed timestamps, GNU tar and libarchive's tar also do this when
generating pax archives, and because the data is highly redundant, it
will compress extremely well.  A comparison between the two approaches
using GNU tar and libarchive's tar on Git's working tree shows that with
default gzip compression, the increase in size is about 1.2%, which is
fine for almost every use case.  In return, we get a substantially
simpler (and thus, likely, more correct) implementation which is much
easier to test.
---
 Documentation/technical/tarball.txt | 234 ++++++++++++++++++++++++++++
 1 file changed, 234 insertions(+)
 create mode 100644 Documentation/technical/tarball.txt

Comments

Junio C Hamano Feb. 6, 2023, 9:08 p.m. UTC | #1
"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> +Overview
> +--------
> +
> +Many people find it convenient to have tar archives that are bit-for-bit
> +identical between versions.  This can be valuable to validate that an archive
> +has not changed using a cryptographic hash without needing to store the archive
> +itself.
> +
> +However, up to now, Git has not guaranteed a consistent format, although people
> +often make the assumption that Git's archives will always be bit-for-bit
> +identical.  This has led to several notable problems with various forges.
> +
> +This document proposes a canonical tar format based on the POSIX pax format that
> +is bit-for-bit identical.  It is referred to as ctar-v1 (canonical tar version 1).

"is identical to what"?  Ditto for the one in the previous
paragraph.  The first paragraph is better in that there is "between
versions", even though it would be easier to grok if we made it more
clear that we are talking about versions of the software that is
used to create the archive, not the version of contents being
archived.

Our goal is that serializing the same tree object or the same commit
object result in bit-for-bit identical result, no matter which
version of Git is used, and no matter what platform the Git used to
create the archive was built on.  Mentioning both what we take an
archive out of (i.e. tree or commit) and we can use different
versions of Git to create archives, in the description would make it
easier to grok.

> +Goals and Rationale
> +-------------------
> +
> +The goals for this format are that it is first and foremost reproducible, that
> +identical trees produce identical results, that it is simple and easy to
> +implement correctly, and that it is useful in general.  While we don't consider
> +functionality needs beyond Git's at the moment (such as hardlinks, xattrs, or
> +sparse files), there is intense interest in reproducible builds, and so it makes
> +sense to design something that can see general use for software interchange.

Perfect.

> +Because the goal is strict reproducibility, this format doesn't honor
> +`tar.umask` or other options that can produce different output.  It serializes
> +all timestamps as the Epoch, which produces identical results whether the tree
> +is serialized as a tree, commit, or tag.  This is consistent with the behaviour
> +of some other tar serializers, including the default for modern Rust crates, and
> +is not believed to pose any interoperability problems.

> +Object IDs are not included in this version of the format because this produces
> +non-identical data when identical data is serialized with different hash
> +algorithms.

Declaring that we'll always peel a tag or a commit down to a tree is
one sure way to avoid having to worry about object name hashes, but
aren't we discarding too much utility by doing so?

This is probably debatable.  The commit object name embedded in the
extended header of an archive makes it trivial to identify what
version the archive _claims_ to have been taken from (you could also
embed it in the filename that stores archive, but the use of the
embedded metainfo makes it more robust against file names).  And
running "git archive" twice, with different versions of Git on
different architectures, should be reproducible as long as both
invokers expressed their desire to see the commit object name in the
archive by passing the commit, not its tree, to the command, and
they are using the same hash algorithm.

In the world where multiple hash functions are in use, a commit that
is being archived may have one or two "object names", but it should
not be hard to use one extended header item per each to store one or
both, I would imagine.

Having said all that, I think stripping the commit object name (or
tags) is a better design.  Imagine that I see I created a tarball
earlier and published its hash, but later lost the tarball.  By not
allowing any commit object name in the archive, it would force me to
somehow name the tarball in such a way that I can tell which commit
I used to create it, e.g. "git-e83c516331.tar".  Other people can
notice the filename and without having seen the bytes in it, they
can try running "git archive e83c516331" in their repository and see
the output matches the hash I published earlier.  Having commit or
tag embedded in the archive would make it harder to do this kind of
things.

By the way, other potentially interesting points are:

 - Do we want to ignore "export-subst" for stability?

 - "git archive" can be invoked with pathspec to archive only a
   subset of paths.

 - "git archive" could be extended to include submoudule trees
   recursively in the same output.

The latter two are trivial to support, but we need to make sure that
we do not screw up the ordering of paths in the output, especially
for the last one, when we add it.

> +Introduction to the Underlying Format
> +-------------------------------------
> ...
> +A global extended header sets metadata for the entire file, and a per-file
> +extended header applies to only the to which it corresponds.  A per-file

"only the to which" -> "only the file to which"

> +extended header overrides any data specified in the global extended header, and
> +all extended headers override any data stored in a normal ustar per-file header
> +block.

> +While pax extensions are widely supported by most modern versions of tar
> +(including versions on Windows and all major open-source OSes), some older
> +archivers and non-tar implementations which do not understand them typically
> +extract the extended headers as regular files.  Thus, it's helpful to have these
> +entries have reasonable permissions and unique names.

Surely, and to make things reproducible, they shouldn't just be
reasonable and unique.  They should be exactly as we define in the
specification.

> +General Architecture
> +--------------------
> +
> +All canonical tar archives are valid POSIX pax archives as that format is
> +defined in POSIX.1-2017.  Every archive will have a global header indicating the
> +version and format and what types of data are valid in the archive.
> +
> +Every file serialized in the archive is serialized in lexicographical order by
> +its bytes.  A directory is always serialized before its contents, and a

"by its bytes" -> "by the bytes in its filename" or something?
Surely we do not sort by contents ;-)

> +directory is never serialized with a trailing slash.  If a system uses a Unicode
> +encoding other than UTF-8, it encodes filenames as UTF-8.

This is a bit hard to grok.  Do you mean there may be UTF-16 system
where the data in our tree objects, whose paths are recorded in UTF-8,
but "git checkout" of the tree may result in files in the native
filename on that system, i.e. UTF-16 not UTF-8?  And even on such a
system, running "git archive" would record paths in the archive in
UTF-8 (i.e. the same as what was in the tree object)?  Or do you
mean something stronger, like on a Latin-1 system with Latin-1
project that used Latin-1 as pathnames even in the tree objects,
when "git archive" produces an archive, the paths in it shall be
transcoded from the original Latin-1 pathnames to UTF-8?

> +Each file shall contain a pax extended header record.
> +
> +It is possible to encode some extended headers in multiple ways because the
> +length in the header encodes its own length.  For example, in cases where the
> +length value can be encoded as either 99 or 100, both can lead to identical
> +header data.  The shortest possible encoding must always be used.

;-)

> +In any event where multiple encodings are possible, the shortest and, if there
> +is still confusion, lexicographically first (by byte value) must always be used.

;-)

> +All unspecified padding is filled with NUL bytes.

Perhaps we should change the casual mention "zero"s we saw earlier
about with "NUL bytes", too.

> +Version Number
> +--------------
> +
> +The version number for this version is `ctar-v1`.
> +
> +Extended Headers
> +----------------
> +
> +Global Extended Header
> +~~~~~~~~~~~~~~~~~~~~~~
> +
> +The global extended header (record `g`) shall contain one header:
> +`CTAR.version`, which contains the version number specified above.
> +
> +The contents of the ustar header for the global extended header are as below,
> +except that the `name` field contains `pax_global_header`.

"as below" meaning...?  The same as what is listed in "Per-File
Extended Header"?  There is no `name` field listed there, though.

> +Per-File Extended Header
> +~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Each file has a per-file extended header.
> +
> +The following per-file extended header fields are included:
> +
> +|===
> +| Field Name   | When Present  | Value
> +
> +| `atime`      | always        | `0`
> +| `mtime`      | always        | `0`
> +| `size`       | always        | size of the data in bytes
> +| `path`       | always        | full path name of the file

These are length-prefixed data, so we do not have to worry about
overly long pathnames or symlinks?

> +| `uid`        | always        | `0`
> +| `gid`        | always        | `0`
> +| `uname`      | always        | `root`
> +| `gname`      | always        | `root`
> +| `linkpath`   | symbolic link | full path name of the link destination
> +| `hdrcharset` | binary path   | `BINARY`
> +
> +Note that the `hdrcharset` entry appears if and only if the `path` or, if
> +present, the `linkpath`, header contains a non-UTF-8 encoded string.  Because
> +Git does not store the encoding of file names, it has no way of knowing whether
> +a file name which could be valid UTF-8 actually is, but for the purposes of
> +compatibility, such file names are assumed to be UTF-8 and are not declared as
> +binary.  This improves portability to systems which always use Unicode.
> +However, we because we do not know for certain whether these values are UTF-8,

"we because" -> "because"

> +we avoid explicitly declaring them as such and rely on the default archiver
> +behavior, which may be more sensible.

So, do we or do we not store hdrcharset?  Producing Git does not know
if the pathnames stored in the tree it is asked to produce archive
for are not in UTF-8, so it assumes everything is in UTF-8 hence
does not see the need to add hdrcharset?

> +The `path` field contains the full path name without a leading slash or leading
> +`.` or `..` component.  The path never contains a directory component which is
> +`.` or `..`.
> +
> +The `linkpath` field contains the full symbolic link destination.  `.` and `..`
> +components are permitted if the destination contains those values.

In other words, we just store the contents of the blob that
represents the symbolic link there?  I wonder if we do anything
special if a blob, that is pointed at in an entry in a tree whose
mode bits are 120000, has NUL in it (should we teach fsck to flag
it, for example)?

> +In all cases, path names use `/` as the directory separator.
> +
> +The reason for always including most of the entries in the archive is to aid in
> +implementing and testing correct serialization.  If these entries are always
> +present, then this process becomes much simpler, whereas if they are only
> +included as needed, then errors are more likely.

The order of entries need to be specified when we aim for
bit-for-bit reproduceability, no?

> +The `name` field of the ustar header of this extended header is `paxheader.%d`,
> +where `%d` represents the shortest-form decimal integer encoding the index of
> +this file in the archive, starting with 0.  All files, directories, and links of
> +whatever kind are counted, but extended headers are not.

> +Serialization of Extended Headers
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +When serializing the header block for an extended header, the following values

"the header block" -> "the ustar header block" to match the next
section, probably.

> +should be used.  Note that all text fields are be NUL-padded on the right when
> +they do not fill the field, and all octal fields are left-padded with zeros such
> +that they fill the field with a single trailing NUL.  An empty field contains
> +only NULs.
> +
> +|===
> +| Field Name | Value
> +
> +| `name`     | `pax_global_header` (global) or `paxheader.%d` (per-file) (see above)
> +| `mode`     | `0640`
> +| `uid`      | `0`
> +| `gid`      | `0`
> +| `size`     | the size of the extended header in bytes
> +| `mtime`    | `0` (the Epoch)
> +| `chksum`   | as specified in the standard
> +| `typeflag` | `g` (global) or `x` (per-file)
> +| `linkname` | empty
> +| `magic`    | as specified in the standard
> +| `version`  | as specified in the standard
> +| `uname`    | `root`
> +| `gname`    | `root`
> +| `devmajor` | `0`
> +| `devminor` | `0`
> +| `prefix`   | empty
> +|===

These are barebone header fields, not extended headers.  Do we want
to refer to some canonical sources so that readers understand that
unlike the extended headres we are talking about fixed-length fields? 
The description above talks about "padding", but that of course
applies to fixed width columns.

> +When encoding the data for an extended header, all entries are sorted in order
> +by the byte values of their keys as encoded in UTF-8.  Duplicate keys are not
> +permitted.
> +
> +Because the format allows multiple length encodings of some values, the shortest
> +possible encoding must always be used.
Ævar Arnfjörð Bjarmason Feb. 6, 2023, 10:18 p.m. UTC | #2
On Sun, Feb 05 2023, brian m. carlson wrote:

> +The goals for this format are that it is first and foremost reproducible, that
> +identical trees produce identical results, that it is simple and easy to
> +implement correctly, and that it is useful in general.  While we don't consider
> +functionality needs beyond Git's at the moment (such as hardlinks, xattrs, or
> +sparse files), there is intense interest in reproducible builds, and so it makes
> +sense to design something that can see general use for software interchange.

I think a goal should be to be bit-for-bit compatible with what we've
had historically, which...

> +Object IDs are not included in this version of the format because this produces
> +non-identical data when identical data is serialized with different hash
> +algorithms.

...this is inherntly at odds with. I had a longer comment about why I
think we can have our cake & eat it too at
https://lore.kernel.org/git/230131.86tu06rkbp.gmgdl@evledraar.gmail.com/

Maybe there are other changes in the proposed spec that put it at odds
with such a goal, it's unclear to me if this is the only difference.

But I don't see why we need bit-for-bit compatible output between SHA-1
and SHA-256 git repos for the reasons noted in the linked-to reply, and
removing this will remove a *really useful* aspect of our tar format,
which is that you can grab an arbitrary tarball, and see what commit
it's produced from.

Even if you want to retain SHA-1 and SHA-256 interop as far as tar is
concerned, an un-discussed alternative is to just stick the SHA-1 OID
into the SHA-256 archive.

For repos that are migrated we envision having such a bi-directional
mapping anyway.

And for those that started out as SHA-256, or where we no longer care
about compatibility with old SHA-1, we can just start including the
SHA-256 OID, as all compatibility concerns have gone away when we
stopped bothering to maintain the mapping, no?

> +|===
> +| Field Name | Value
> +
> +| `name`     | the last path component if it fits; otherwise, `path.%d`
> +| `mode`     | `0640` (regular file), `0777` (symbolic link), `0750` (directory)
> +| `uid`      | `0`
> +| `gid`      | `0`
> +| `size`     | the size of the data in bytes for regular files if it fits; otherwise, `0`
> +| `mtime`    | `0` (the Epoch)
> +| `chksum`   | as specified in the standard

This is the nth reference to "the standard". I think this would be
improved by linking to it, isn't it
https://pubs.opengroup.org/onlinepubs/9699919799/utilities/pax.html ?
brian m. carlson Feb. 7, 2023, 10:34 p.m. UTC | #3
On 2023-02-06 at 21:08:59, Junio C Hamano wrote:
> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> "is identical to what"?  Ditto for the one in the previous
> paragraph.  The first paragraph is better in that there is "between
> versions", even though it would be easier to grok if we made it more
> clear that we are talking about versions of the software that is
> used to create the archive, not the version of contents being
> archived.
> 
> Our goal is that serializing the same tree object or the same commit
> object result in bit-for-bit identical result, no matter which
> version of Git is used, and no matter what platform the Git used to
> create the archive was built on.  Mentioning both what we take an
> archive out of (i.e. tree or commit) and we can use different
> versions of Git to create archives, in the description would make it
> easier to grok.

I can update that to reflect things more accurately.

> > +Goals and Rationale
> > +-------------------
> > +
> > +The goals for this format are that it is first and foremost reproducible, that
> > +identical trees produce identical results, that it is simple and easy to
> > +implement correctly, and that it is useful in general.  While we don't consider
> > +functionality needs beyond Git's at the moment (such as hardlinks, xattrs, or
> > +sparse files), there is intense interest in reproducible builds, and so it makes
> > +sense to design something that can see general use for software interchange.
> 
> Perfect.
> 
> > +Because the goal is strict reproducibility, this format doesn't honor
> > +`tar.umask` or other options that can produce different output.  It serializes
> > +all timestamps as the Epoch, which produces identical results whether the tree
> > +is serialized as a tree, commit, or tag.  This is consistent with the behaviour
> > +of some other tar serializers, including the default for modern Rust crates, and
> > +is not believed to pose any interoperability problems.
> 
> > +Object IDs are not included in this version of the format because this produces
> > +non-identical data when identical data is serialized with different hash
> > +algorithms.
> 
> Declaring that we'll always peel a tag or a commit down to a tree is
> one sure way to avoid having to worry about object name hashes, but
> aren't we discarding too much utility by doing so?
> 
> This is probably debatable.  The commit object name embedded in the
> extended header of an archive makes it trivial to identify what
> version the archive _claims_ to have been taken from (you could also
> embed it in the filename that stores archive, but the use of the
> embedded metainfo makes it more robust against file names).  And
> running "git archive" twice, with different versions of Git on
> different architectures, should be reproducible as long as both
> invokers expressed their desire to see the commit object name in the
> archive by passing the commit, not its tree, to the command, and
> they are using the same hash algorithm.

It's true that it makes it easy to look up, but I can say I've never
used that functionality.  I think very few people actually know it
exists.

> Having said all that, I think stripping the commit object name (or
> tags) is a better design.  Imagine that I see I created a tarball
> earlier and published its hash, but later lost the tarball.  By not
> allowing any commit object name in the archive, it would force me to
> somehow name the tarball in such a way that I can tell which commit
> I used to create it, e.g. "git-e83c516331.tar".  Other people can
> notice the filename and without having seen the bytes in it, they
> can try running "git archive e83c516331" in their repository and see
> the output matches the hash I published earlier.  Having commit or
> tag embedded in the archive would make it harder to do this kind of
> things.

Most people do this anyway (except with a tag name), so I don't think
it's a big deal to have this as the primary mechanism.

> By the way, other potentially interesting points are:
> 
>  - Do we want to ignore "export-subst" for stability?

I think that would be a good idea.  I'll add it in v2.

>  - "git archive" can be invoked with pathspec to archive only a
>    subset of paths.

True.  I don't think that's a problem as long as we generate paths
correctly.  I'll be sure to add tests for it, though.

> > +Introduction to the Underlying Format
> > +-------------------------------------
> > ...
> > +A global extended header sets metadata for the entire file, and a per-file
> > +extended header applies to only the to which it corresponds.  A per-file
> 
> "only the to which" -> "only the file to which"

Will fix.

> > +While pax extensions are widely supported by most modern versions of tar
> > +(including versions on Windows and all major open-source OSes), some older
> > +archivers and non-tar implementations which do not understand them typically
> > +extract the extended headers as regular files.  Thus, it's helpful to have these
> > +entries have reasonable permissions and unique names.
> 
> Surely, and to make things reproducible, they shouldn't just be
> reasonable and unique.  They should be exactly as we define in the
> specification.

Yes, of course.  This is more to indicate why we've made the decisions
to name them as they are and give them the permissions we did.

> > +Every file serialized in the archive is serialized in lexicographical order by
> > +its bytes.  A directory is always serialized before its contents, and a
> 
> "by its bytes" -> "by the bytes in its filename" or something?
> Surely we do not sort by contents ;-)

Good point.  We should avoid ambiguity.

> > +directory is never serialized with a trailing slash.  If a system uses a Unicode
> > +encoding other than UTF-8, it encodes filenames as UTF-8.
> 
> This is a bit hard to grok.  Do you mean there may be UTF-16 system
> where the data in our tree objects, whose paths are recorded in UTF-8,
> but "git checkout" of the tree may result in files in the native
> filename on that system, i.e. UTF-16 not UTF-8?  And even on such a
> system, running "git archive" would record paths in the archive in
> UTF-8 (i.e. the same as what was in the tree object)?  Or do you
> mean something stronger, like on a Latin-1 system with Latin-1
> project that used Latin-1 as pathnames even in the tree objects,
> when "git archive" produces an archive, the paths in it shall be
> transcoded from the original Latin-1 pathnames to UTF-8?

This means if, on Windows, someone uses --add-file or
--add-virtual-file, those paths will be encoded in UTF-8, not UTF-16.

> > +Version Number
> > +--------------
> > +
> > +The version number for this version is `ctar-v1`.
> > +
> > +Extended Headers
> > +----------------
> > +
> > +Global Extended Header
> > +~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +The global extended header (record `g`) shall contain one header:
> > +`CTAR.version`, which contains the version number specified above.
> > +
> > +The contents of the ustar header for the global extended header are as below,
> > +except that the `name` field contains `pax_global_header`.
> 
> "as below" meaning...?  The same as what is listed in "Per-File
> Extended Header"?  There is no `name` field listed there, though.

I'll make a clearer reference.

> > +Per-File Extended Header
> > +~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +Each file has a per-file extended header.
> > +
> > +The following per-file extended header fields are included:
> > +
> > +|===
> > +| Field Name   | When Present  | Value
> > +
> > +| `atime`      | always        | `0`
> > +| `mtime`      | always        | `0`
> > +| `size`       | always        | size of the data in bytes
> > +| `path`       | always        | full path name of the file
> 
> These are length-prefixed data, so we do not have to worry about
> overly long pathnames or symlinks?

Correct.  This data can be arbitrarily long as long as all the metadata
can be encoded in a ustar header, so we're limited to at least several
gigabytes or so.  I don't think anybody thinks of that as a practical
limitation on filenames or other metadata.

> "we because" -> "because"

Will fix.

> > +we avoid explicitly declaring them as such and rely on the default archiver
> > +behavior, which may be more sensible.
> 
> So, do we or do we not store hdrcharset?  Producing Git does not know
> if the pathnames stored in the tree it is asked to produce archive
> for are not in UTF-8, so it assumes everything is in UTF-8 hence
> does not see the need to add hdrcharset?

pax says that these values are UTF-8 if not specified.  If they're
clearly not UTF-8, we use `hdcharset` and say they're binary.  If they
look like valid UTF-8, we don't use `hdrcharset` and pretend they are in
fact UTF-8, in case somebody just likes causing discord by using
Windows-1252 that looks like UTF-8.

> In other words, we just store the contents of the blob that
> represents the symbolic link there?  I wonder if we do anything
> special if a blob, that is pointed at in an entry in a tree whose
> mode bits are 120000, has NUL in it (should we teach fsck to flag
> it, for example)?

This is the destination of the symlink, yes.  We can simply check for
NUL and abort; I don't think that's an unreasonable behaviour in any
case.

> The order of entries need to be specified when we aim for
> bit-for-bit reproduceability, no?

Yes.  That's specified in the next section, where we say this:

  When encoding the data for an extended header, all entries are sorted in order
  by the byte values of their keys as encoded in UTF-8.  Duplicate keys are not
  permitted.

I'll make a reference to that section and describe it more clearly.

> "the header block" -> "the ustar header block" to match the next
> section, probably.

I'll update that.

> These are barebone header fields, not extended headers.  Do we want
> to refer to some canonical sources so that readers understand that
> unlike the extended headres we are talking about fixed-length fields? 
> The description above talks about "padding", but that of course
> applies to fixed width columns.

Correct.  I'll mention that these are the values in the ustar header for
the extended header.  I'll also put some references in to the
documentation.
brian m. carlson Feb. 7, 2023, 11:01 p.m. UTC | #4
On 2023-02-06 at 22:18:47, Ævar Arnfjörð Bjarmason wrote:
> Maybe there are other changes in the proposed spec that put it at odds
> with such a goal, it's unclear to me if this is the only difference.

As mentioned in the description, that doesn't address trees, which have
never been consistent traditionally.  We also have bad permissions for
pax headers (always 666), which is something we've tried to fix before
and is not something we want to carry on with.

You specifically sent a patch stating that we're not guaranteeing that
format, and I agree with that assessment.  I'm proposing a format that
we would guarantee and which does not have any of the historical baggage
or warts that we don't want to keep.

This format also doesn't serialize timestamps; everything is at the
Epoch.  Again, that's because serializing a commit and its tree or even
a tag and its commit would produce different results.

> But I don't see why we need bit-for-bit compatible output between SHA-1
> and SHA-256 git repos for the reasons noted in the linked-to reply, and
> removing this will remove a *really useful* aspect of our tar format,
> which is that you can grab an arbitrary tarball, and see what commit
> it's produced from.

True, but this is a highly obscure feature and I've never used it
outside of testing.  If you want it, you can have it: you just want the
default format, which serializes it in the header, and not the extremely
restricted format I'm proposing here which is designed to never ever
change.  We might well decide to add cool new features and useful
information to the default format, but this one will be fixed forever.

> Even if you want to retain SHA-1 and SHA-256 interop as far as tar is
> concerned, an un-discussed alternative is to just stick the SHA-1 OID
> into the SHA-256 archive.
> 
> For repos that are migrated we envision having such a bi-directional
> mapping anyway.
> 
> And for those that started out as SHA-256, or where we no longer care
> about compatibility with old SHA-1, we can just start including the
> SHA-256 OID, as all compatibility concerns have gone away when we
> stopped bothering to maintain the mapping, no?

Whether SHA-1 or SHA-256 or both are present in the repo is a local
decision.  The transition plan specifically anticipates people either
preferring one hash or the other in output.  The behaviour is not "use
SHA-1 if there's SHA-1 and use SHA-256 otherwise", because even if
everyone has SHA-256 and prefers it on their system, some people may
still have SHA-1 for historical reasons and that would lead to different
output.

Part of this is because I anticipate that once the interop work is done,
GitHub may transition repositories on the server to SHA-256 with SHA-1
interop for existing SHA-1 repositories.  People are still going to have
a fit if tarball data breaks at some point because the repository owner
decided to flip the default hash algorithm, and I'm specifically
proposing a format that is not going to direct hordes of angry users in
my direction or the repository owner's in that case.  Lots of people are
going to avoid switching the default hash algorithm if it breaks
tarballs, and I specifically don't want to encourage people sticking
with SHA-1 for that reason.

> This is the nth reference to "the standard". I think this would be
> improved by linking to it, isn't it
> https://pubs.opengroup.org/onlinepubs/9699919799/utilities/pax.html ?

Yeah, I'll do that.
Ævar Arnfjörð Bjarmason Feb. 8, 2023, 11:07 a.m. UTC | #5
On Tue, Feb 07 2023, brian m. carlson wrote:

> [[PGP Signed Part:Undecided]]
> On 2023-02-06 at 22:18:47, Ævar Arnfjörð Bjarmason wrote:
>> Maybe there are other changes in the proposed spec that put it at odds
>> with such a goal, it's unclear to me if this is the only difference.
>
> As mentioned in the description, that doesn't address trees, which have
> never been consistent traditionally.

You mention "[...]it produces identical results for identical trees,
regardless of hash algorithm". I'm not familiar with how we encode trees
differently based on the hash algorithm. Do we stick the tree OID in
there somewhere, or is it something else?

IOW do these trees vary within the same hash algorithm, or is it another
special-case where we now produce a different tarball with SHA-1 and
SHA-256 with commits, but also with trees?

B.t.w. are there some options to tar(1) to make it dump these headers
you're describing? I coludn't find anything when looking, it looks like
libtar might support it, but I was hoping for something more compatible
with my lazyness :)

> We also have bad permissions for pax headers (always 666), which is
> something we've tried to fix before and is not something we want to
> carry on with.

I'm concerned that you're expanding the scope of a "stable" tar format
to necessarily include one-off fixing various things we've regretted
over the years.

Maybe that *needs* to happen, but so far I don't see why, you've
described:

* We include the OID in the metadata
* Something like that, but for trees?
* The sucky 0666 permissions we'd like to fix.
* We don't serialize timestamps (which is now optional, depending on how
  you invoke it)

I really applaud your efforts here, but I don't see if that's the extent
of the changes why the v1 and default format shouldn't be something that
produces identical results to "git archive" as it stands today.

Then a v1/v2 is just this pseudocode, isn't it?

 	switch (version) {
	case 1:
		break; /* warts and all */
	case 2:
		include_oid = 0;
		satanic_permissions = 0;
		no_timestamps = 1;
		break;
	}

The reason we haven't promised to support an archive format isn't
because we didn't find every aspect of it aesthetically pleasing, but
because we didn't want to commit to some bug-for-bug compatibility with
whatever the code is doing right now.

Now that you've done the work to specify it, it turns out that a
proposed format you'd like going forward is almost identical to what we
currently emit, to the point that supporting that as a v1 seems rather
trivial (but again, I may still be missing something).

We have a huge long-tail of users in the wild, forcing those users to go
through a one-time breakage of their existing archives if we could avoid
that by making v1 the current format seems entirely unnecessary.

I totally see your point about wanting byte-for-byte the same archives
out of the SHA-1 and SHA-256 version of the same repo, I think that's a
good goal, and it's also a good goal to get rid of these other warts.

But I don't see why it needs to be required, or even the default.

> You specifically sent a patch stating that we're not guaranteeing that
> format, and I agree with that assessment.  I'm proposing a format that
> we would guarantee and which does not have any of the historical baggage
> or warts that we don't want to keep.

Per the above I just don't see why that's a criteria. I think we should
be weighing the benefits of changing the existing default "git archive"
output v.s. the cost of maintaining the delta to some v2 wart-less
format.

> This format also doesn't serialize timestamps; everything is at the
> Epoch.  Again, that's because serializing a commit and its tree or even
> a tag and its commit would produce different results.

This seems like further scope creep, and in this particular case I don't
see how *always* doing that helps you with reputability.

I.e. for the cases where we're now given a top-level tree it's obvious
how this helps, we encode the time(), so every time it's different.

But in the case where we get a commit (or tag) ID we use the timestamp
in the commit (or tag?) envelope.

When producing a release archive, or packing up a given commit that's
therefore going to be stable, even between SHA-1 and SHA-256, although
those two would differ if the OID is put in the header, but that's
another matter.

If I understand you correctly here you seem to be in pursuit of another
goal entirely, which is that you'd like the same output for different
commits if they're TREESAME.

Or, if you have a bunch of release archives a very nice attribute of
this is that with a bunch of similar archives on the same FS you could
e.g. benefit more from block-level deduplication.

All of which is cool, but I don't see why it needs to be a hard
requirement in the design.

>> But I don't see why we need bit-for-bit compatible output between SHA-1
>> and SHA-256 git repos for the reasons noted in the linked-to reply, and
>> removing this will remove a *really useful* aspect of our tar format,
>> which is that you can grab an arbitrary tarball, and see what commit
>> it's produced from.
>
> True, but this is a highly obscure feature and I've never used it
> outside of testing.

I admit that's a bit obscure, but one of those things that really comes
in handy when you need it, I vaguely remember using it once or twice and
being very happy it was there.

But related to that is setting everything to epoch:0, doesn't that mean
that when you unpack say a release archive that in common filesystem
browsers all of the files will be dated in the 70s, as opposed to the
time of release as it is now?

> If you want it, you can have it: you just want the
> default format, which serializes it in the header, and not the extremely
> restricted format I'm proposing here which is designed to never ever
> change.

Okey, so I might have to take back much of what I said about, so you're
not opposed to supporting the current format as a "v1" or whatever,
you'd just like this propsoed "v2" (or "vstable", or whatever) to have
some "blessed" status.

I just don't get why we wouldn't support both, if the delta is as small
as seems to be the case. If that's right this "v2" is less "extremely
restricted" to our current "v1", and more "almost identical", just "a
bit less wart-y".

> We might well decide to add cool new features and useful
> information to the default format, but this one will be fixed forever.

I just don't see the target audience for that. As the issues that
prompted these on-list discussions show we have people in the wild who
deeply care about the current format.

They probably care enough about that that we're likely to try to support
that forever, at least I don't see any currently proposed change to the
format that seems worth breaking things for those users.

So, if that's the case we'd have a v1 (current), this "vstable" (never
changes), and a v2 (v1 + extra neat things), etc.

Then we'd be maintaining 3 formats instead of 2 formats (a "v1" and
"vunstable").

>> Even if you want to retain SHA-1 and SHA-256 interop as far as tar is
>> concerned, an un-discussed alternative is to just stick the SHA-1 OID
>> into the SHA-256 archive.
>> 
>> For repos that are migrated we envision having such a bi-directional
>> mapping anyway.
>> 
>> And for those that started out as SHA-256, or where we no longer care
>> about compatibility with old SHA-1, we can just start including the
>> SHA-256 OID, as all compatibility concerns have gone away when we
>> stopped bothering to maintain the mapping, no?
>
> Whether SHA-1 or SHA-256 or both are present in the repo is a local
> decision.  The transition plan specifically anticipates people either
> preferring one hash or the other in output.  The behaviour is not "use
> SHA-1 if there's SHA-1 and use SHA-256 otherwise", because even if
> everyone has SHA-256 and prefers it on their system, some people may
> still have SHA-1 for historical reasons and that would lead to different
> output.

Yes, but who has this issue in practice? In practice people are
producing archives as part of some release process.

As long as they keep using SHA-1 to make release they're fine, at some
point they'll switch over to SHA-256 by default, and then their new
releases will use SHA-256.

If they then have to for some reason go back to an old commit when SHA-1
was the default it might be a tiny hassle, but no more than doing the
same if the format had changed entirely.

> Part of this is because I anticipate that once the interop work is done,
> GitHub may transition repositories on the server to SHA-256 with SHA-1
> interop for existing SHA-1 repositories.  People are still going to have
> a fit if tarball data breaks at some point because the repository owner
> decided to flip the default hash algorithm, and I'm specifically
> proposing a format that is not going to direct hordes of angry users in
> my direction or the repository owner's in that case.  Lots of people are
> going to avoid switching the default hash algorithm if it breaks
> tarballs, and I specifically don't want to encourage people sticking
> with SHA-1 for that reason.

I see that, I don't see how your plan isn't a perfect recipe for
creating the problem you're trying to avoid.

You have tarballs generated with the current format today, 3rd party
systems are dynamically downloading e.g. v1.0.0.tar.gz or whatever, and
expecting it to byte-for-byte match previous downloads.

If you're going to switch to some stable format surely that would either
need to involve massive one-off breakage, or you'd have some "flag day",
from today all new archives are produced with the new "stable" method.

If that "stable" format is different (among other things, but the others
seem equally trival) because you wanted to extract the OID from the
format for SHA-1 and SHA-256 interop, why can't the day the repository
switched to SHA-256 be that flag day?
brian m. carlson Feb. 8, 2023, 11:52 p.m. UTC | #6
On 2023-02-08 at 11:07:44, Ævar Arnfjörð Bjarmason wrote:
> 
> On Tue, Feb 07 2023, brian m. carlson wrote:
> 
> > [[PGP Signed Part:Undecided]]
> > On 2023-02-06 at 22:18:47, Ævar Arnfjörð Bjarmason wrote:
> >> Maybe there are other changes in the proposed spec that put it at odds
> >> with such a goal, it's unclear to me if this is the only difference.
> >
> > As mentioned in the description, that doesn't address trees, which have
> > never been consistent traditionally.
> 
> You mention "[...]it produces identical results for identical trees,
> regardless of hash algorithm". I'm not familiar with how we encode trees
> differently based on the hash algorithm. Do we stick the tree OID in
> there somewhere, or is it something else?

If you pass a commit or tag on the command line, you get the timestamp
of the commit or tag.  If you pass a tree, you get the current
timestamp.  Thus, whether the output is reproducible depends on the type
of object you specify.

> IOW do these trees vary within the same hash algorithm, or is it another
> special-case where we now produce a different tarball with SHA-1 and
> SHA-256 with commits, but also with trees?

When we write an archive, we embed a comment with the commit object ID
(see next response).  That's using the hash algorithm in the repository.
If we write an archive for a tree, no object ID is embedded.

> B.t.w. are there some options to tar(1) to make it dump these headers
> you're describing? I coludn't find anything when looking, it looks like
> libtar might support it, but I was hoping for something more compatible
> with my lazyness :)

I don't think so.  However, you can see them with `git archive --format=tar HEAD | env -u LESSOPEN less -R`.

The body of the global header looks like this (my indentation):

  52 comment=7ff60001dae72ac39783ca536a4b673862b28587

If you want to see what GNU tar produces, you can run `tar -cf - --posix --exclude .git . | env -u LESSOPEN less -R`:

  30 mtime=1675633909.844009705
  30 atime=1675895555.716075364
  30 ctime=1675633909.844009705

> I'm concerned that you're expanding the scope of a "stable" tar format
> to necessarily include one-off fixing various things we've regretted
> over the years.

Well, yes, because if we're specifying a stable format, we should make it
something we want to support long term.  Right now, we don't guarantee
anything; if we find something unsatisfactory, we just fix it.

> Then a v1/v2 is just this pseudocode, isn't it?
> 
>  	switch (version) {
> 	case 1:
> 		break; /* warts and all */
> 	case 2:
> 		include_oid = 0;
> 		satanic_permissions = 0;
> 		no_timestamps = 1;
> 		break;
> 	}

As I mentioned in the doc, there are multiple ways to encode various
things like lengths and the order of headers.  It's not immediately
obvious from the code how our length encoding works, and that's the kind
of code that could easily have a small refactor or bug fix break things
really badly.

Additionally, we, like most other pax implementations, just encode
headers in whatever order we thought was most expedient when
implementing, and sometimes they're emitted and sometimes they're not.
That's a really great recipe for behaviour that is extremely hard to
test and extremely hard to reproduce.  For example, we'd have to test
the interaction with long paths and symlinks, long paths and large
files, and several other sets of variants to make sure that a minor
refactor doesn't change output.  The current logic of the code is very
subtle.

> Now that you've done the work to specify it, it turns out that a
> proposed format you'd like going forward is almost identical to what we
> currently emit, to the point that supporting that as a v1 seems rather
> trivial (but again, I may still be missing something).

It's relatively similar.  The format I'm proposing is much stricter and
more regular than what we do now.

I'm thinking that the changes will be limited to writing three or four
functions.  It's not terribly invasive, but there will be some departure
from the existing code.

> We have a huge long-tail of users in the wild, forcing those users to go
> through a one-time breakage of their existing archives if we could avoid
> that by making v1 the current format seems entirely unnecessary.

Because right now, the current code is not amenable to producing or
testing reproducible output.  Any significant refactor of the existing
code will result in an output change unless the author is extremely
careful, and I'm not comfortable guaranteeing the current format with
that caveat.  The reason the data hasn't changed is because such a
refactor hasn't happened yet.

I'm specifically thinking about the length calculation in
`strbuf_append_ext_header`, which is extremely magical, and the path
splitting in `get_path_prefix`.  Those are both extremely subtle and
logical places to perform a refactor or adjustment that might change
output in a very minor way for a tiny subset of files.

> When producing a release archive, or packing up a given commit that's
> therefore going to be stable, even between SHA-1 and SHA-256, although
> those two would differ if the OID is put in the header, but that's
> another matter.
> 
> If I understand you correctly here you seem to be in pursuit of another
> goal entirely, which is that you'd like the same output for different
> commits if they're TREESAME.
> 
> Or, if you have a bunch of release archives a very nice attribute of
> this is that with a bunch of similar archives on the same FS you could
> e.g. benefit more from block-level deduplication.
> 
> All of which is cool, but I don't see why it needs to be a hard
> requirement in the design.

I think it's valuable to have the same input data produce the same
output.  That means that I can use Git to produce the archive, or some
other tool implementing the same format, and it just works.  If GNU tar,
libgit2, or libarchive implemented the same format with an option,
people would also be able to produce an identical archive as long as
they excluded the files in `.gitignore` and `.git`.  That approach is
very valuable if you need to slightly modify the contents of the archive
that Git produced in a way not supported by --add-file (and Junio used
to do that himself for Git releases before --add-file).

> But related to that is setting everything to epoch:0, doesn't that mean
> that when you unpack say a release archive that in common filesystem
> browsers all of the files will be dated in the 70s, as opposed to the
> time of release as it is now?

Yes.  That's also the case for current Rust crates and lots of other
reproducible archives.  I've heard exactly zero complaints about that
behaviour since I implemented it in Cargo.  Looking back at the history,
apparently there's some broken behaviour with the actual Epoch and lldb
(because 0 is a sentinel), but the change is just to switch to a
timestamp of 1 instead of 0, which I can do in the next version of my
patch.  No other problems seem to have come up with using a fixed
timestamp.

The only place where I could imagine this being a problem is if you used
Make in a directory after unpacking a new archive over the old one, but
that is a terrible idea in the first place since that leaves now-removed
files from the old version behind which will probably cause your build
to fail at some point.  In any event, because almost everyone uses
`--prefix` with the version number for their archives, it's difficult to
even perform that extraction over top anyway, and so it's unlikely that
anyone actually does such a thing.

Otherwise, there's typically no functional difference.

> Okey, so I might have to take back much of what I said about, so you're
> not opposed to supporting the current format as a "v1" or whatever,
> you'd just like this propsoed "v2" (or "vstable", or whatever) to have
> some "blessed" status.

No, I'm not opposed to supporting both.  There's "default" (v0 if you
like) and "v1".  If you say, "I'd like a tarball", you get what we
produce now (or what it changes to in the future).  If you say, "I want
bit-for-bit compatibility", then you get v1.

> I just don't get why we wouldn't support both, if the delta is as small
> as seems to be the case. If that's right this "v2" is less "extremely
> restricted" to our current "v1", and more "almost identical", just "a
> bit less wart-y".

Right, I think it's very easy to do.

> I just don't see the target audience for that. As the issues that
> prompted these on-list discussions show we have people in the wild who
> deeply care about the current format.
> 
> They probably care enough about that that we're likely to try to support
> that forever, at least I don't see any currently proposed change to the
> format that seems worth breaking things for those users.

I don't think there's any purpose in guaranteeing the current format,
given what I've said above about testability and the risk of breakage
during a refactor with the current code, and I don't think the project
should do that.  However, downstream users, including various forges, may
wish to do so, and if so I wish them all the best.

> If you're going to switch to some stable format surely that would either
> need to involve massive one-off breakage, or you'd have some "flag day",
> from today all new archives are produced with the new "stable" method.

Nope.  There's simply a new option to produce v1 archives and people
switch over as part of their normal build system maintenance, and
eventually nobody cares about the ancient versions depending on the old
format.
Ævar Arnfjörð Bjarmason Feb. 9, 2023, 12:35 a.m. UTC | #7
On Wed, Feb 08 2023, brian m. carlson wrote:

> [[PGP Signed Part:Undecided]]
> On 2023-02-08 at 11:07:44, Ævar Arnfjörð Bjarmason wrote:
>> 
>> On Tue, Feb 07 2023, brian m. carlson wrote:
>> 
>> > [[PGP Signed Part:Undecided]]
>> > On 2023-02-06 at 22:18:47, Ævar Arnfjörð Bjarmason wrote:
>> >> Maybe there are other changes in the proposed spec that put it at odds
>> >> with such a goal, it's unclear to me if this is the only difference.
>> >
>> > As mentioned in the description, that doesn't address trees, which have
>> > never been consistent traditionally.
>> 
>> You mention "[...]it produces identical results for identical trees,
>> regardless of hash algorithm". I'm not familiar with how we encode trees
>> differently based on the hash algorithm. Do we stick the tree OID in
>> there somewhere, or is it something else?
>
> If you pass a commit or tag on the command line, you get the timestamp
> of the commit or tag.  If you pass a tree, you get the current
> timestamp.  Thus, whether the output is reproducible depends on the type
> of object you specify.

Ah, that's all. So just the case documented in the openign pargraphs of
the DESCRIPTION of git-archive.

I think it's probably a good thing to enforce the epoch:0 for the cases
where the time varies now, or maybe we'd say "if you want stable
archives, point to a commit or tag", but that's ultimately a UX issue.

But regardless of any other points we may differ on I think if that's
what you meant this section of the doc is really confusing:

	It serializes all timestamps as the Epoch, which produces
	identical results whether the tree is serialized as a tree,
	commit, or tag.

Because we do produce identical results now for commit or tag, so what's
epoch:0 for those solving?

>> IOW do these trees vary within the same hash algorithm, or is it another
>> special-case where we now produce a different tarball with SHA-1 and
>> SHA-256 with commits, but also with trees?
>
> When we write an archive, we embed a comment with the commit object ID
> (see next response).  That's using the hash algorithm in the repository.
> If we write an archive for a tree, no object ID is embedded.

Yeah, it was answered in the preceding. I.e. there's no special "tree"
magic here beyond the long-standing dynamic epoch behavior for the
"tree" case, thanks.

>> B.t.w. are there some options to tar(1) to make it dump these headers
>> you're describing? I coludn't find anything when looking, it looks like
>> libtar might support it, but I was hoping for something more compatible
>> with my lazyness :)
>
> I don't think so.  However, you can see them with `git archive --format=tar HEAD | env -u LESSOPEN less -R`.
>
> The body of the global header looks like this (my indentation):
>
>   52 comment=7ff60001dae72ac39783ca536a4b673862b28587
>
> If you want to see what GNU tar produces, you can run `tar -cf - --posix --exclude .git . | env -u LESSOPEN less -R`:
>
>   30 mtime=1675633909.844009705
>   30 atime=1675895555.716075364
>   30 ctime=1675633909.844009705

*nod*, some variant of that is how I've been looking at this,
thanks.

>> I'm concerned that you're expanding the scope of a "stable" tar format
>> to necessarily include one-off fixing various things we've regretted
>> over the years.
>
> Well, yes, because if we're specifying a stable format, we should make it
> something we want to support long term.  Right now, we don't guarantee
> anything; if we find something unsatisfactory, we just fix it.

I won't repeat myself too much here, but we disagree on that.

If it's e.g. just the 666 permissions or whatever I don't see the value
of changing that just for the sake of asthetics. Or to satisfy some more
obscure use-cases (which must by definition be pretty obscure, as "git
archive" is in wide production use now with those warts).

But having said that...

>> Then a v1/v2 is just this pseudocode, isn't it?
>> 
>>  	switch (version) {
>> 	case 1:
>> 		break; /* warts and all */
>> 	case 2:
>> 		include_oid = 0;
>> 		satanic_permissions = 0;
>> 		no_timestamps = 1;
>> 		break;
>> 	}
>
> As I mentioned in the doc, there are multiple ways to encode various
> things like lengths and the order of headers.  It's not immediately
> obvious from the code how our length encoding works, and that's the kind
> of code that could easily have a small refactor or bug fix break things
> really badly.

...I'm willing to be convinced otherwise.

So yes, if e.g. we find that to implement your proposed format we'd need
to change some tricky existing archiving code from 500 current lines of
tricky code, to either 100 neat lines in the new "stable" format, or 600
if we keep both, I'd be much less included to argue for this.

But I also think your proposal isn't doing itself any favors by pointing
out things that are clearly trivial to support without changes as the
things we should be "fixing" for a new stable format, e.g. modes and
epochs.

> Additionally, we, like most other pax implementations, just encode
> headers in whatever order we thought was most expedient when
> implementing, and sometimes they're emitted and sometimes they're not.
> That's a really great recipe for behaviour that is extremely hard to
> test and extremely hard to reproduce.  For example, we'd have to test
> the interaction with long paths and symlinks, long paths and large
> files, and several other sets of variants to make sure that a minor
> refactor doesn't change output.  The current logic of the code is very
> subtle.

Why do we need to exhaustively specify or test something we're not
testing or specifying now? Wouldn't something like this in some select
places in archive-tar.c be sufficient (the write_tar_entry2() being some
new format):

	diff --git a/archive-tar.c b/archive-tar.c
	index f8fad2946ef..4f8ca02a82a 100644
	--- a/archive-tar.c
	+++ b/archive-tar.c
	@@ -257,8 +257,17 @@ static int write_tar_entry(struct archiver_args *args,
	 	unsigned long size_in_header;
	 	int err = 0;
	 
	+	switch (args->version) {
	+	case 1: break;
	+	case 2: return write_tar_entry2(...);
	+	}
	+
	 	memset(&header, 0, sizeof(header));
	 
	+	/*
	+	 * This logic implements the tar v1 format, don't even change
	+	 * this, change write_tar_entry{2,3,...}() etc instead.
	+	 */
	 	if (S_ISDIR(mode) || S_ISGITLINK(mode)) {
	 		*header.typeflag = TYPEFLAG_DIR;
	 		mode = (mode | 0777) & ~tar_umask;
	diff --git a/archive.h b/archive.h
	index 08bed3ed3af..a2f72bbea7e 100644
	--- a/archive.h
	+++ b/archive.h
	@@ -24,6 +24,7 @@ struct archiver_args {
	 	int compression_level;
	 	struct string_list extra_files;
	 	struct pretty_print_context *pretty_ctx;
	+	unsigned int tar_version;
	 };
	 
	 /* main api */

This is analogous to what we did for "git patch-id
[--unstable|--stable]", the latter is more specified, the former is
"we're keeping around whatever we did before".

But this is all under my ongoing assumption that the main target
audience for archive stability that cares about any of this someone with
existing archives who'd like to not go through some one-off migration if
they can help it.

>> Now that you've done the work to specify it, it turns out that a
>> proposed format you'd like going forward is almost identical to what we
>> currently emit, to the point that supporting that as a v1 seems rather
>> trivial (but again, I may still be missing something).
>
> It's relatively similar.  The format I'm proposing is much stricter and
> more regular than what we do now.
>
> I'm thinking that the changes will be limited to writing three or four
> functions.  It's not terribly invasive, but there will be some departure
> from the existing code.

Right, all of this sounds like mostly changes to write_tar_entry(),
prepare_header() and write_extended_header() in archive-tar.c

>> We have a huge long-tail of users in the wild, forcing those users to go
>> through a one-time breakage of their existing archives if we could avoid
>> that by making v1 the current format seems entirely unnecessary.
>
> Because right now, the current code is not amenable to producing or
> testing reproducible output.  Any significant refactor of the existing
> code will result in an output change unless the author is extremely
> careful, and I'm not comfortable guaranteeing the current format with
> that caveat.  The reason the data hasn't changed is because such a
> refactor hasn't happened yet.
>
> I'm specifically thinking about the length calculation in
> `strbuf_append_ext_header`, which is extremely magical, and the path
> splitting in `get_path_prefix`.  Those are both extremely subtle and
> logical places to perform a refactor or adjustment that might change
> output in a very minor way for a tiny subset of files.

Yeah, this sort of thing is all stuff I'm very much willing to be sold
on, and I think if the proposal had led with some of these things...

...but on the other hand the combined length of those two functions is
around 30 lines, and if we just "freeze" it (along with the existing
header writing code) doesn't look like something that would be much of a
hassle to just keep around, but maybe it all adds up.

>> When producing a release archive, or packing up a given commit that's
>> therefore going to be stable, even between SHA-1 and SHA-256, although
>> those two would differ if the OID is put in the header, but that's
>> another matter.
>> 
>> If I understand you correctly here you seem to be in pursuit of another
>> goal entirely, which is that you'd like the same output for different
>> commits if they're TREESAME.
>> 
>> Or, if you have a bunch of release archives a very nice attribute of
>> this is that with a bunch of similar archives on the same FS you could
>> e.g. benefit more from block-level deduplication.
>> 

>> requirement in the design.
>
> I think it's valuable to have the same input data produce the same
> output.  That means that I can use Git to produce the archive, or some
> other tool implementing the same format, and it just works.  If GNU tar,
> libgit2, or libarchive implemented the same format with an option,
> people would also be able to produce an identical archive as long as
> they excluded the files in `.gitignore` and `.git`.  That approach is
> very valuable if you need to slightly modify the contents of the archive
> that Git produced in a way not supported by --add-file (and Junio used
> to do that himself for Git releases before --add-file).

They'd produce the same output if they just read the timestamp in the
commit envelope. GNU tar has a --mtime, so isn't this a matter of a
one-liner to "git show" that spews out the envelope time in a format it
groks?

>> But related to that is setting everything to epoch:0, doesn't that mean
>> that when you unpack say a release archive that in common filesystem
>> browsers all of the files will be dated in the 70s, as opposed to the
>> time of release as it is now?
>
> Yes.  That's also the case for current Rust crates and lots of other
> reproducible archives.  I've heard exactly zero complaints about that
> behaviour since I implemented it in Cargo.

I don't think someone unpacking a bunch of files from different
archives, doing "sort by date" in their file browser, and finding that
it's all from the 70s is something people would open bug reports about.

It's just a small fringe UX benefit, but it also seems like a small
matter to keep the non-zero epochs.

> Looking back at the history,
> apparently there's some broken behaviour with the actual Epoch and lldb
> (because 0 is a sentinel), but the change is just to switch to a
> timestamp of 1 instead of 0, which I can do in the next version of my
> patch.  No other problems seem to have come up with using a fixed
> timestamp.

Hah, I wonder if in 10 years some other system will discover that epoch
1 is used as a workaround for such issues, and hardcode that 1 is also a
meaningless sentinel value, so the next format will have epoch 2 etc.... :)

> The only place where I could imagine this being a problem is if you used
> Make in a directory after unpacking a new archive over the old one, but
> that is a terrible idea in the first place since that leaves now-removed
> files from the old version behind which will probably cause your build
> to fail at some point.

I hadn't thought of that, but that's another benefit to meaningful
timestamps.

I wouldn't call it a "terrible idea", it's a really well supported
pattern in "make", and this project with its messy Makefile infra even
supports it at least 95% of the time :)

(There's some messy picking up of old files logic with Documentation/*
in particular).

> in any event, because almost everyone uses
> `--prefix` with the version number for their archives, it's difficult to
> even perform that extraction over top anyway,[...]

There's going to be a lot of use cases on the edges, e.g. having
multiple prefix-unpacked archives, and wanting to use "find" with
"mtime" to run a meaningful search over them.

> and so it's unlikely that anyone actually does such a thing.

I think it's unlikely that we'll ever have more than a double-digit
number of people who'll implement any proposed archive format we come up
with, whereas the consumers will be a lot more digits :)

> Otherwise, there's typically no functional difference.

I think we're going to disagre on this point, but I think your propsal
would be improved if it clearly carved out this case, because it's
currently claiming to be in service reproducible archves, which isn't
really the case...

> [...]
>> I just don't see the target audience for that. As the issues that
>> prompted these on-list discussions show we have people in the wild who
>> deeply care about the current format.
>> 
>> They probably care enough about that that we're likely to try to support
>> that forever, at least I don't see any currently proposed change to the
>> format that seems worth breaking things for those users.
>
> I don't think there's any purpose in guaranteeing the current format,
> given what I've said above about testability and the risk of breakage
> during a refactor with the current code, and I don't think the project
> should do that.  However, downstream users, including various forges, may
> wish to do so, and if so I wish them all the best.
>
>> If you're going to switch to some stable format surely that would either
>> need to involve massive one-off breakage, or you'd have some "flag day",
>> from today all new archives are produced with the new "stable" method.
>
> Nope.  There's simply a new option to produce v1 archives and people
> switch over as part of their normal build system maintenance, and
> eventually nobody cares about the ancient versions depending on the old
> format.

Or more likely they'll never make that switch, or even read the docs
about this. If they did in the first place that we'd have had nobody
rely on this implementation detail, but people do.

And once we change the "unstable" format we've now given ourselves
license to more freely change their systems break just as they did
before.

Maybe I'm just wrong here, and clearly in the long term your approach
will win out. If anyone's using these features in git in 10-20 years
we'll probably pat ourselves on the back in having gone through some
sort of transition here (although I think if it's not the default it
probably won't happen).

I just don't see why the problem at hand calls for a spec, a new format
to migrate to etc., as opposed to just adding comments or something like
that to the existing code saying "hey, people rely on this, consider
format changes carefully".

Then we kick the can down the road on any transition, which we can still
do if and when we find a really compelling reason for why we must change
the format.

So far the only "we must change it" that we have now AFAICT is the issue
of the OID in the comment, which for the reasons upthread I think is
pretty much a non-issue, and in any case for those that are willing to
migrate to a new format much more easily solved by a
--dont-add-the-oid-comment.

But if you're willing to work on all of this that's neat, although I do
reserve the right to bikeshed about some of these changes :)

It just seems rather orthogonal to the in-the-wild problem, is all.

Cheers.
diff mbox series

Patch

diff --git a/Documentation/technical/tarball.txt b/Documentation/technical/tarball.txt
new file mode 100644
index 0000000000..fd23df2f33
--- /dev/null
+++ b/Documentation/technical/tarball.txt
@@ -0,0 +1,234 @@ 
+Git Canonical Tar Format
+========================
+
+Overview
+--------
+
+Many people find it convenient to have tar archives that are bit-for-bit
+identical between versions.  This can be valuable to validate that an archive
+has not changed using a cryptographic hash without needing to store the archive
+itself.
+
+However, up to now, Git has not guaranteed a consistent format, although people
+often make the assumption that Git's archives will always be bit-for-bit
+identical.  This has led to several notable problems with various forges.
+
+This document proposes a canonical tar format based on the POSIX pax format that
+is bit-for-bit identical.  It is referred to as ctar-v1 (canonical tar version 1).
+
+Goals and Rationale
+-------------------
+
+The goals for this format are that it is first and foremost reproducible, that
+identical trees produce identical results, that it is simple and easy to
+implement correctly, and that it is useful in general.  While we don't consider
+functionality needs beyond Git's at the moment (such as hardlinks, xattrs, or
+sparse files), there is intense interest in reproducible builds, and so it makes
+sense to design something that can see general use for software interchange.
+
+Because the goal is strict reproducibility, this format doesn't honor
+`tar.umask` or other options that can produce different output.  It serializes
+all timestamps as the Epoch, which produces identical results whether the tree
+is serialized as a tree, commit, or tag.  This is consistent with the behaviour
+of some other tar serializers, including the default for modern Rust crates, and
+is not believed to pose any interoperability problems.
+
+Object IDs are not included in this version of the format because this produces
+non-identical data when identical data is serialized with different hash
+algorithms.
+
+Introduction to the Underlying Format
+-------------------------------------
+
+A pax archive is an extended form of the ustar (Unix standard tar) archive, both
+defined in POSIX.1-2001.  Each file in a ustar archive is preceded by a single
+512-byte header block, followed by as many 512-byte data blocks as needed to
+store the data, padded with zeros.  At the end of the archive are two 512-byte
+blocks filled completely with zeros.
+
+A pax archive may additionally contain extended headers.  There is optionally
+one for the entire archive, which is called the global header, and one for each
+file.  If present, the global extended header is the first entry in the archive,
+and the per-file header precedes the file to which it corresponds.  Every
+extended header contains a normal ustar header block with either the `g`
+(global) or `x` (per-file) type, followed by metadata in a textual
+length-key-value form (`%d %s=%s\n`) which is stored as the data of this
+pseudo-file.
+
+A global extended header sets metadata for the entire file, and a per-file
+extended header applies to only the to which it corresponds.  A per-file
+extended header overrides any data specified in the global extended header, and
+all extended headers override any data stored in a normal ustar per-file header
+block.
+
+While pax extensions are widely supported by most modern versions of tar
+(including versions on Windows and all major open-source OSes), some older
+archivers and non-tar implementations which do not understand them typically
+extract the extended headers as regular files.  Thus, it's helpful to have these
+entries have reasonable permissions and unique names.
+
+General Architecture
+--------------------
+
+All canonical tar archives are valid POSIX pax archives as that format is
+defined in POSIX.1-2017.  Every archive will have a global header indicating the
+version and format and what types of data are valid in the archive.
+
+Every file serialized in the archive is serialized in lexicographical order by
+its bytes.  A directory is always serialized before its contents, and a
+directory is never serialized with a trailing slash.  If a system uses a Unicode
+encoding other than UTF-8, it encodes filenames as UTF-8.  Each file shall
+contain a pax extended header record.
+
+It is possible to encode some extended headers in multiple ways because the
+length in the header encodes its own length.  For example, in cases where the
+length value can be encoded as either 99 or 100, both can lead to identical
+header data.  The shortest possible encoding must always be used.
+
+In any event where multiple encodings are possible, the shortest and, if there
+is still confusion, lexicographically first (by byte value) must always be used.
+All unspecified padding is filled with NUL bytes.
+
+Version Number
+--------------
+
+The version number for this version is `ctar-v1`.
+
+Extended Headers
+----------------
+
+Global Extended Header
+~~~~~~~~~~~~~~~~~~~~~~
+
+The global extended header (record `g`) shall contain one header:
+`CTAR.version`, which contains the version number specified above.
+
+The contents of the ustar header for the global extended header are as below,
+except that the `name` field contains `pax_global_header`.
+
+Per-File Extended Header
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+Each file has a per-file extended header.
+
+The following per-file extended header fields are included:
+
+|===
+| Field Name   | When Present  | Value
+
+| `atime`      | always        | `0`
+| `mtime`      | always        | `0`
+| `size`       | always        | size of the data in bytes
+| `path`       | always        | full path name of the file
+| `uid`        | always        | `0`
+| `gid`        | always        | `0`
+| `uname`      | always        | `root`
+| `gname`      | always        | `root`
+| `linkpath`   | symbolic link | full path name of the link destination
+| `hdrcharset` | binary path   | `BINARY`
+
+Note that the `hdrcharset` entry appears if and only if the `path` or, if
+present, the `linkpath`, header contains a non-UTF-8 encoded string.  Because
+Git does not store the encoding of file names, it has no way of knowing whether
+a file name which could be valid UTF-8 actually is, but for the purposes of
+compatibility, such file names are assumed to be UTF-8 and are not declared as
+binary.  This improves portability to systems which always use Unicode.
+
+However, we because we do not know for certain whether these values are UTF-8,
+we avoid explicitly declaring them as such and rely on the default archiver
+behavior, which may be more sensible.
+
+The `path` field contains the full path name without a leading slash or leading
+`.` or `..` component.  The path never contains a directory component which is
+`.` or `..`.
+
+The `linkpath` field contains the full symbolic link destination.  `.` and `..`
+components are permitted if the destination contains those values.
+
+In all cases, path names use `/` as the directory separator.
+
+The reason for always including most of the entries in the archive is to aid in
+implementing and testing correct serialization.  If these entries are always
+present, then this process becomes much simpler, whereas if they are only
+included as needed, then errors are more likely.
+
+The `name` field of the ustar header of this extended header is `paxheader.%d`,
+where `%d` represents the shortest-form decimal integer encoding the index of
+this file in the archive, starting with 0.  All files, directories, and links of
+whatever kind are counted, but extended headers are not.
+
+Serialization of Extended Headers
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When serializing the header block for an extended header, the following values
+should be used.  Note that all text fields are be NUL-padded on the right when
+they do not fill the field, and all octal fields are left-padded with zeros such
+that they fill the field with a single trailing NUL.  An empty field contains
+only NULs.
+
+|===
+| Field Name | Value
+
+| `name`     | `pax_global_header` (global) or `paxheader.%d` (per-file) (see above)
+| `mode`     | `0640`
+| `uid`      | `0`
+| `gid`      | `0`
+| `size`     | the size of the extended header in bytes
+| `mtime`    | `0` (the Epoch)
+| `chksum`   | as specified in the standard
+| `typeflag` | `g` (global) or `x` (per-file)
+| `linkname` | empty
+| `magic`    | as specified in the standard
+| `version`  | as specified in the standard
+| `uname`    | `root`
+| `gname`    | `root`
+| `devmajor` | `0`
+| `devminor` | `0`
+| `prefix`   | empty
+|===
+
+When encoding the data for an extended header, all entries are sorted in order
+by the byte values of their keys as encoded in UTF-8.  Duplicate keys are not
+permitted.
+
+Because the format allows multiple length encodings of some values, the shortest
+possible encoding must always be used.
+
+ustar headers
+-------------
+
+The ustar header for each file is serialized as below.  Note that all text
+fields are be NUL-padded on the right when they do not fill the field, and all
+octal fields are left-padded with zeros such that they fill the field with a
+single trailing NUL.  An empty field contains only NULs.
+
+|===
+| Field Name | Value
+
+| `name`     | the last path component if it fits; otherwise, `path.%d`
+| `mode`     | `0640` (regular file), `0777` (symbolic link), `0750` (directory)
+| `uid`      | `0`
+| `gid`      | `0`
+| `size`     | the size of the data in bytes for regular files if it fits; otherwise, `0`
+| `mtime`    | `0` (the Epoch)
+| `chksum`   | as specified in the standard
+| `typeflag` | `0` (regular file), `2` (symbolic link), `5` (directory)
+| `linkname` | empty
+| `magic`    | as specified in the standard
+| `version`  | as specified in the standard
+| `uname`    | `root`
+| `gname`    | `root`
+| `devmajor` | `0`
+| `devminor` | `0`
+| `prefix`   | all non-trailing path components if they fit; otherwise, empty
+|===
+
+Note that the `size` field is always 0 for non-regular files.  The `typeflag`
+value for regular files is always `0`, not NUL.
+
+`prefix` does not contain a trailing slash.
+
+If the `name` field cannot contain the last path component, then it is
+serialized as `path.%d`, where `%d` represents the shortest-form decimal integer
+encoding the index of this file in the archive, starting with 0.  The `%d` value
+in this case is completely identical to the `%d` in the per-file pax header.