diff mbox series

[4/4] docs: note that archives are not stable

Message ID 20210227191813.96148-5-sandals@crustytoothpaste.net (mailing list archive)
State New
Headers show
Series Documentation updates to FAQ and git-archive | expand

Commit Message

brian m. carlson Feb. 27, 2021, 7:18 p.m. UTC
We have in the past told users on the list that git archive does not
necessarily produce stable archives, but we've never explicitly
documented this.  Unfortunately, we've had people in the past who have
relied on the relative stability of our archives to their detriment and
then had breakage occur.

Let's tell people that we don't guarantee stable archives so that they
can make good choices about how they structure their tooling and don't
end up with problems if we need to change archives later.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 Documentation/git-archive.txt | 3 +++
 1 file changed, 3 insertions(+)

Comments

Ævar Arnfjörð Bjarmason Feb. 28, 2021, 12:48 p.m. UTC | #1
On Sat, Feb 27 2021, brian m. carlson wrote:

> We have in the past told users on the list that git archive does not
> necessarily produce stable archives, but we've never explicitly
> documented this.  Unfortunately, we've had people in the past who have
> relied on the relative stability of our archives to their detriment and
> then had breakage occur.
>
> Let's tell people that we don't guarantee stable archives so that they
> can make good choices about how they structure their tooling and don't
> end up with problems if we need to change archives later.
>
> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> ---
>  Documentation/git-archive.txt | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/Documentation/git-archive.txt b/Documentation/git-archive.txt
> index 9f8172828d..1f126cbdcc 100644
> --- a/Documentation/git-archive.txt
> +++ b/Documentation/git-archive.txt
> @@ -30,6 +30,9 @@ extended pax header if the tar format is used; it can be extracted
>  using 'git get-tar-commit-id'. In ZIP files it is stored as a file
>  comment.
>  
> +The output of 'git archive' is not guaranteed to be stable and may change
> +between versions.

Is "stable archive" a well-known term people would understand, or is
someone going to read this thinking they might extract different content
today than tomorrow ? :) I wonder how much if anything this means to
someone not privy to the recent thread[1] that prompted this patch.

Perhaps something like this instead:

    The output of 'git archive' is guaranteed to be the same across
    versions of git, but the archive itself is not guaranteed to be
    bit-for-bit identical.

    In practice the output of 'git archive' is relatively stable across
    git versions, but has changed in the past, and most likely will in
    the future.

    Since the tar format provides multiple ways to encode the same
    output (ordering, headers, padding etc.) you should not rely on
    output being bit-for-bit identical across versions of git for
    e.g. GPG signing a SHA-256 hash of an archive generated with one
    version of git, and then expecting to be able to validate that GPG
    signature with a freshly generated archive made with same arguments
    on another version of git.

1. https://lore.kernel.org/git/20210122213954.7dlnnpngjoay3oia@chatter.i7.local/
brian m. carlson Feb. 28, 2021, 6:19 p.m. UTC | #2
On 2021-02-28 at 12:48:56, Ævar Arnfjörð Bjarmason wrote:
> Perhaps something like this instead:
> 
>     The output of 'git archive' is guaranteed to be the same across
>     versions of git, but the archive itself is not guaranteed to be
>     bit-for-bit identical.
> 
>     In practice the output of 'git archive' is relatively stable across
>     git versions, but has changed in the past, and most likely will in
>     the future.
> 
>     Since the tar format provides multiple ways to encode the same
>     output (ordering, headers, padding etc.) you should not rely on
>     output being bit-for-bit identical across versions of git for
>     e.g. GPG signing a SHA-256 hash of an archive generated with one
>     version of git, and then expecting to be able to validate that GPG
>     signature with a freshly generated archive made with same arguments
>     on another version of git.

I think something like this is good.  I'm a bit nervous about telling
people that the output is relatively stable because that will likely
push people in the direction that we don't want to encourage.  I might
rephrase the first two paragraphs as so:

  The output of 'git archive' is guaranteed to be the same across
  versions of git, but the archive itself is not guaranteed to be
  bit-for-bit identical.  The output of 'git archive' has changed
  in the past, and most likely will in the future.

I'm not very familiar with the zip format, but I assume that it also has
features that allow equivalent but not bit-for-bit equal archives.
Looking at Wikipedia leads me to believe that one could indeed create
different archives just by either writing a Zip64 record or not, and if
we store the SHA-1 revision ID in a comment, then we would also produce
a different archive when using an equivalent SHA-256 repo.  And of
course there's compression, which allows many different but equivalent
serializations.  So we'd probably need to say the same thing about zip
files as well.
Ævar Arnfjörð Bjarmason Feb. 28, 2021, 6:46 p.m. UTC | #3
On Sun, Feb 28 2021, brian m. carlson wrote:

> On 2021-02-28 at 12:48:56, Ævar Arnfjörð Bjarmason wrote:
>> Perhaps something like this instead:
>> 
>>     The output of 'git archive' is guaranteed to be the same across
>>     versions of git, but the archive itself is not guaranteed to be
>>     bit-for-bit identical.
>> 
>>     In practice the output of 'git archive' is relatively stable across
>>     git versions, but has changed in the past, and most likely will in
>>     the future.
>> 
>>     Since the tar format provides multiple ways to encode the same
>>     output (ordering, headers, padding etc.) you should not rely on
>>     output being bit-for-bit identical across versions of git for
>>     e.g. GPG signing a SHA-256 hash of an archive generated with one
>>     version of git, and then expecting to be able to validate that GPG
>>     signature with a freshly generated archive made with same arguments
>>     on another version of git.
>
> I think something like this is good.  I'm a bit nervous about telling
> people that the output is relatively stable because that will likely
> push people in the direction that we don't want to encourage.  I might
> rephrase the first two paragraphs as so:
>
>   The output of 'git archive' is guaranteed to be the same across
>   versions of git, but the archive itself is not guaranteed to be
>   bit-for-bit identical.  The output of 'git archive' has changed
>   in the past, and most likely will in the future.
>
> I'm not very familiar with the zip format, but I assume that it also has
> features that allow equivalent but not bit-for-bit equal archives.
> Looking at Wikipedia leads me to believe that one could indeed create
> different archives just by either writing a Zip64 record or not, and if
> we store the SHA-1 revision ID in a comment, then we would also produce
> a different archive when using an equivalent SHA-256 repo.  And of
> course there's compression, which allows many different but equivalent
> serializations.  So we'd probably need to say the same thing about zip
> files as well.

Yes, I think your version is better, and we should have some wording so
it generalizes to the various output formats we support, perhaps further
noting that the "relatively stable" (if you want to keep a note of that)
only refers to our own output, not when we invoke gzip or zip.

I thought that "relatively stable" and "[when you extract it you get the
same thing]" were good to note, to say that e.g. GPG signing across
versions = bad, but if you e.g. offer downloadable archives with the
contents of tags, there's no reason to make your git version a part of a
cache key for the purposes of saving yourself CPU time when
(re-)generating them.
Junio C Hamano March 1, 2021, 6:15 p.m. UTC | #4
"brian m. carlson" <sandals@crustytoothpaste.net> writes:

>   The output of 'git archive' is guaranteed to be the same across
>   versions of git, but the archive itself is not guaranteed to be
>   bit-for-bit identical.

I do not quite get this; your original was clearer.  What does it
mean to "be the same across versions of git but not identical" at
the same time?  If output from Git version 1.0 and 2.0 are guranteed
to be the same across versions, what more is there for the readers
to worry about the format stability?

Perhaps you meant

	... is guaranteed to be the same for any given version of
	Git across ports.

or something?  It would allow kernel.org's use of "Konstantin tells
kernel.org users to use Git version X to run 'git archive' and
create detached signature on the output, and upload only the
signature.  The site uses the same Git version X to run 'git
archive' to create a tarball and the detached signature magically
matches, as the output on two places are bit-for-bit identical".

>   The output of 'git archive' has changed
>   in the past, and most likely will in the future.

That is correct as a statement of fact.  I feel that saying it is
either redundant and insufficient at the same time.  If we want to
tell them "do not depend on the output being bit-for-bit identical",
we should say it more explicitly after this sentence, I would think.
brian m. carlson March 3, 2021, 12:36 a.m. UTC | #5
On 2021-03-01 at 18:15:29, Junio C Hamano wrote:
> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> 
> >   The output of 'git archive' is guaranteed to be the same across
> >   versions of git, but the archive itself is not guaranteed to be
> >   bit-for-bit identical.
> 
> I do not quite get this; your original was clearer.  What does it
> mean to "be the same across versions of git but not identical" at
> the same time?  If output from Git version 1.0 and 2.0 are guranteed
> to be the same across versions, what more is there for the readers
> to worry about the format stability?
> 
> Perhaps you meant
> 
> 	... is guaranteed to be the same for any given version of
> 	Git across ports.
> 
> or something?  It would allow kernel.org's use of "Konstantin tells
> kernel.org users to use Git version X to run 'git archive' and
> create detached signature on the output, and upload only the
> signature.  The site uses the same Git version X to run 'git
> archive' to create a tarball and the detached signature magically
> matches, as the output on two places are bit-for-bit identical".

I think what I had intended was that Git produces deterministic output,
but I don't actually think that's true across ports.  If someone uses a
different version of zlib on a different OS, the output may differ.

I'll rephrase to avoid giving a misleading impression.

> >   The output of 'git archive' has changed
> >   in the past, and most likely will in the future.
> 
> That is correct as a statement of fact.  I feel that saying it is
> either redundant and insufficient at the same time.  If we want to
> tell them "do not depend on the output being bit-for-bit identical",
> we should say it more explicitly after this sentence, I would think.

I agree we should explicitly say that.
Junio C Hamano March 3, 2021, 6:55 a.m. UTC | #6
"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> I think what I had intended was that Git produces deterministic output,
> but I don't actually think that's true across ports.  If someone uses a
> different version of zlib on a different OS, the output may differ.

I agree.  When I wrote my response, I had the "tar" format in mind,
which we write everything ourselves, but zip and also the compressed
output is a different story---we do rely on third-party libraries.

Thanks.
diff mbox series

Patch

diff --git a/Documentation/git-archive.txt b/Documentation/git-archive.txt
index 9f8172828d..1f126cbdcc 100644
--- a/Documentation/git-archive.txt
+++ b/Documentation/git-archive.txt
@@ -30,6 +30,9 @@  extended pax header if the tar format is used; it can be extracted
 using 'git get-tar-commit-id'. In ZIP files it is stored as a file
 comment.
 
+The output of 'git archive' is not guaranteed to be stable and may change
+between versions.
+
 OPTIONS
 -------