mbox series

[0/9] git archive: use gzip again by default, document output stabilty

Message ID cover-0.9-00000000000-20230202T093212Z-avarab@gmail.com (mailing list archive)
Headers show
Series git archive: use gzip again by default, document output stabilty | expand

Message

Ævar Arnfjörð Bjarmason Feb. 2, 2023, 9:32 a.m. UTC
As reported in
https://lore.kernel.org/git/a812a664-67ea-c0ba-599f-cb79e2d96694@gmail.com/
changing the default "tgz" output method of from "gzip(1)" to our
internal "git archive gzip" (using zlib ) broke things for users in
the wild that assume that the "git archive" output is stable, most
notably GitHub: https://github.com/orgs/community/discussions/45830

Leaving aside the larger question of whether we're going to promise
output stability for "git archive" in general, the motivation for that
change was to have a working compression method on systems that lacked
a gzip(1).

As the disruption of changing the default isn't worth it, let's use
gzip(1) again by default, and only fall back on the new "git archive
gzip" if it isn't available.

The later parts of this series then document and test for the output
stability of the command.

We're not promising anything new there, except that we now promise
that we're going to use "gzip" as the default compressor, but that
it's up to that command to be stable, should the user desire output
stability.

The documentation discusses the various caveats involved, suggests
alternatives to checksumming compressed archives, but in the end notes
what's been the policy so far: We're not promising that the "tar"
output is going to be stable.

The early parts of this series (1-2/9) are clean-up for existing
config drift, as later in the series we'll otherwise need to change
the divergent config documentation in two places.

CI & branch for this at:
https://github.com/avar/git/tree/avar/archive-internal-gzip-not-the-default

Ævar Arnfjörð Bjarmason (9):
  archive & tar config docs: de-duplicate configuration section
  git config docs: document "tar.<format>.{command,remote}"
  archiver API: make the "flags" in "struct archiver" an enum
  archive: omit the shell for built-in "command" filters
  archive-tar.c: move internal gzip implementation to a function
  archive: use "gzip -cn" for stability, not "git archive gzip"
  test-lib.sh: add a lazy GZIP prerequisite
  archive tests: test for "gzip -cn" and "git archive gzip" stability
  git archive docs: document output non-stability

 Documentation/config/tar.txt           | 29 +++++++-
 Documentation/git-archive.txt          | 96 +++++++++++++++++++-------
 archive-tar.c                          | 78 ++++++++++++++-------
 archive.h                              | 11 +--
 t/t5000-tar-tree.sh                    |  2 -
 t/t5005-archive-stability.sh           | 70 +++++++++++++++++++
 t/t5562-http-backend-content-length.sh |  2 -
 t/test-lib.sh                          |  4 ++
 8 files changed, 231 insertions(+), 61 deletions(-)
 create mode 100755 t/t5005-archive-stability.sh

Comments

Phillip Wood Feb. 2, 2023, 4:17 p.m. UTC | #1
Hi Ævar

On 02/02/2023 09:32, Ævar Arnfjörð Bjarmason wrote:
> As reported in
> https://lore.kernel.org/git/a812a664-67ea-c0ba-599f-cb79e2d96694@gmail.com/
> changing the default "tgz" output method of from "gzip(1)" to our
> internal "git archive gzip" (using zlib ) broke things for users in
> the wild that assume that the "git archive" output is stable, most
> notably GitHub: https://github.com/orgs/community/discussions/45830
 >
> Leaving aside the larger question of whether we're going to promise
> output stability for "git archive" in general, the motivation for that
> change was to have a working compression method on systems that lacked
> a gzip(1).

As I recall the reduction in cpu time used to create a compressed 
archive was a factor in making it the default.

> As the disruption of changing the default isn't worth it, let's use
> gzip(1) again by default, and only fall back on the new "git archive
> gzip" if it isn't available.

Playing devil's advocate for a moment as we're not going to promise that 
the compressed output of "git archive" will be stable in the future 
perhaps we should use this breakage as an opportunity to highlight that 
to users and to advertize the config setting that allows them to use 
gzip for compressing archives. Reverting the change gives the misleading 
impression that we're making a commitment to keeping the output stable. 
The focus of this thread seems to be the problems relating to github 
which they have already addressed.

I think there is general agreement that it is not practical to promise 
that the compressed output of "git archive" is stable so maybe it is 
better to make that clear now while users can work around it in the 
short term with a config setting rather than waiting until we're faced 
with some security or other issue that forces a change to the output 
which users cannot work around so easily.

Best Wishes

Phillip


> The later parts of this series then document and test for the output
> stability of the command.
> 
> We're not promising anything new there, except that we now promise
> that we're going to use "gzip" as the default compressor, but that
> it's up to that command to be stable, should the user desire output
> stability.
> 
> The documentation discusses the various caveats involved, suggests
> alternatives to checksumming compressed archives, but in the end notes
> what's been the policy so far: We're not promising that the "tar"
> output is going to be stable.
> 
> The early parts of this series (1-2/9) are clean-up for existing
> config drift, as later in the series we'll otherwise need to change
> the divergent config documentation in two places.
> 
> CI & branch for this at:
> https://github.com/avar/git/tree/avar/archive-internal-gzip-not-the-default
> 
> Ævar Arnfjörð Bjarmason (9):
>    archive & tar config docs: de-duplicate configuration section
>    git config docs: document "tar.<format>.{command,remote}"
>    archiver API: make the "flags" in "struct archiver" an enum
>    archive: omit the shell for built-in "command" filters
>    archive-tar.c: move internal gzip implementation to a function
>    archive: use "gzip -cn" for stability, not "git archive gzip"
>    test-lib.sh: add a lazy GZIP prerequisite
>    archive tests: test for "gzip -cn" and "git archive gzip" stability
>    git archive docs: document output non-stability
> 
>   Documentation/config/tar.txt           | 29 +++++++-
>   Documentation/git-archive.txt          | 96 +++++++++++++++++++-------
>   archive-tar.c                          | 78 ++++++++++++++-------
>   archive.h                              | 11 +--
>   t/t5000-tar-tree.sh                    |  2 -
>   t/t5005-archive-stability.sh           | 70 +++++++++++++++++++
>   t/t5562-http-backend-content-length.sh |  2 -
>   t/test-lib.sh                          |  4 ++
>   8 files changed, 231 insertions(+), 61 deletions(-)
>   create mode 100755 t/t5005-archive-stability.sh
>
Junio C Hamano Feb. 2, 2023, 4:25 p.m. UTC | #2
Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:

> As the disruption of changing the default isn't worth it, let's use
> gzip(1) again by default, and only fall back on the new "git archive
> gzip" if it isn't available.

It perhaps is OK, and lets us answer "ugh, the compressed output of
'git archive' is unstable again" with "we didn't change anything,
perhaps you changed your gzip(1)?" when they fix bugs or improve
compression or whatever.  Of course that is not an overall win for
the end users, but in the short term until gzip gets such a change,
we would presumably get the "same" output as before.
Junio C Hamano Feb. 2, 2023, 4:40 p.m. UTC | #3
Phillip Wood <phillip.wood123@gmail.com> writes:

> ... Reverting the change
> gives the misleading impression that we're making a commitment to
> keeping the output stable. The focus of this thread seems to be the
> problems relating to github which they have already addressed.
>
> I think there is general agreement that it is not practical to promise
> that the compressed output of "git archive" is stable so maybe it is
> better to make that clear now while users can work around it in the
> short term with a config setting rather than waiting until we're faced
> with some security or other issue that forces a change to the output
> which users cannot work around so easily.

I love to see somebody else play the devil's advocate role.  Thanks
for all of the above.
Raymond E. Pasco Feb. 2, 2023, 7:23 p.m. UTC | #4
February 2, 2023 11:17 AM, "Phillip Wood" <phillip.wood123@gmail.com> wrote:
> Playing devil's advocate for a moment as we're not going to promise that the compressed output of
> "git archive" will be stable in the future perhaps we should use this breakage as an opportunity to
> highlight that to users and to advertize the config setting that allows them to use gzip for
> compressing archives. Reverting the change gives the misleading impression that we're making a
> commitment to keeping the output stable. The focus of this thread seems to be the problems relating
> to github which they have already addressed.
> 
> I think there is general agreement that it is not practical to promise that the compressed output
> of "git archive" is stable so maybe it is better to make that clear now while users can work around
> it in the short term with a config setting rather than waiting until we're faced with some security
> or other issue that forces a change to the output which users cannot work around so easily.

Reverting to the behavior of "use some arbitrary gzip from $PATH" would
be a poor decision whether or not git were willing to make some
commitment to gzip stability, because Git does not control arbitrary
gzips on the user's $PATH. If Git did want to promise gzip stability, it 
could only start from something like the current internal implementation
along with a vendored zlib; if it doesn't, as appears to be the case, 
then the internal implementation is superior for the other reasons 
already discussed.

If the user wants to depend on a particular gzip executable they supply, 
this configuration knob already exists for them.

Since there is no guarantee of stability, but there has been a popular 
misconception that there is some such guarantee (e.g., [1]), some kind 
of STABILITY section describing how there isn't any and suggesting ways
the user can attain more stability via configuration seems to be a good
idea.

[1]: https://lists.reproducible-builds.org/pipermail/rb-general/2021-October/002422.html
Ævar Arnfjörð Bjarmason Feb. 3, 2023, 1:49 p.m. UTC | #5
On Thu, Feb 02 2023, Phillip Wood wrote:

> On 02/02/2023 09:32, Ævar Arnfjörð Bjarmason wrote:
>> As reported in
>> https://lore.kernel.org/git/a812a664-67ea-c0ba-599f-cb79e2d96694@gmail.com/
>> changing the default "tgz" output method of from "gzip(1)" to our
>> internal "git archive gzip" (using zlib ) broke things for users in
>> the wild that assume that the "git archive" output is stable, most
>> notably GitHub: https://github.com/orgs/community/discussions/45830
>>
>> Leaving aside the larger question of whether we're going to promise
>> output stability for "git archive" in general, the motivation for that
>> change was to have a working compression method on systems that lacked
>> a gzip(1).
>
> As I recall the reduction in cpu time used to create a compressed
> archive was a factor in making it the default.

I read those references in 76d7602631a (archive-tar: add internal gzip
implementation, 2022-06-15) more of a "it's not [much] slower", the flip
to the default in 4f4be00d302 (archive-tar: use internal gzip by
default, 2022-06-15) didn't discuss it.

So I didn't think it was important enough to mention (even though we're
now back to the faster "gzip" method).

>> As the disruption of changing the default isn't worth it, let's use
>> gzip(1) again by default, and only fall back on the new "git archive
>> gzip" if it isn't available.
>
> Playing devil's advocate for a moment as we're not going to promise
> that the compressed output of "git archive" will be stable in the
> future perhaps we should use this breakage as an opportunity to
> highlight that to users and to advertize the config setting that
> allows them to use gzip for compressing archives.

If we were trying to intentionally break things for those users we could
do a lot better than "git archive gzip", whose output is mostly the same
as "gzip", we could tweak one of the headers to make it different all
the time.

But I think it's better to advocate for such intentional chaos-monkeying
as a follow-up to this more conservative "oops, we broke stuff, it's
easy not to break it, so let's not do it'.

> Reverting the change gives the misleading impression that we're making
> a commitment to keeping the output stable.

I don't see how you can conclude that from this series. It explicitly
states that we make no such promises, what it does is go back to
allowing the gzip(1) command to make its own promises.

> The focus of this thread seems to be the
> problems relating to github which they have already addressed.

Which they've addressed by reverting the change, but while they're a
major user of git they're not the only one. They just happened to use
"git archive".

I think it would be a mistake to conclude that everyone who's run into
this has already done so, or is aware of it.

> I think there is general agreement that it is not practical to promise
> that the compressed output of "git archive" is stable so maybe it is
> better[...]

...better than what? This seems to imply that this series is making new
promises about the output stability, which it isn't doing.

> [...]to make that clear now while users can work around it in the
> short term with a config setting rather than waiting until we're faced
> with some security or other issue that forces a change to the output
> which users cannot work around so easily.

I think it's always been clear that you can use that setting. For ages
we've been saying:

	The `tar.gz` and `tgz` formats are defined automatically and use the
	command `gzip -cn` by default.

Then v2.38.0 changed it to:

	[...]
        magic command `git archive gzip` by default

Which IMO was easily missed among other "Performance, Internal
Implementation, Development Support etc." items in the release notes,
which said:

   Teach "git archive" to (optionally and then by default) avoid
   spawning an external "gzip" process when creating ".tar.gz" (and
   ".tgz") archives.

But I agree that all of this is subjective. To me a 2% reduction in CPU
use (at the cost of ~20% increse in wallclock) & some unclear benefits
to teaching users that they can't rely on our "gzip" output seems
unclear or hypothetical.

Whereas the widespread breakage reported is very real, and we should
consider GitHub as a canary for that, not the the stand & end of its
potential impact.

As we didn't have a strong reason to change this in the first place (and
as my series shows, we can have our cake & eat it too if we don't have a
"gzip") I think the obvious choice is to go back to using "gzip".
Theodore Ts'o Feb. 3, 2023, 3:47 p.m. UTC | #6
On Thu, Feb 02, 2023 at 04:17:09PM +0000, Phillip Wood wrote:
> Playing devil's advocate for a moment as we're not going to promise that the
> compressed output of "git archive" will be stable in the future perhaps we
> should use this breakage as an opportunity to highlight that to users and to
> advertize the config setting that allows them to use gzip for compressing
> archives. Reverting the change gives the misleading impression that we're
> making a commitment to keeping the output stable. The focus of this thread
> seems to be the problems relating to github which they have already
> addressed.
> 
> I think there is general agreement that it is not practical to promise that
> the compressed output of "git archive" is stable so maybe it is better to
> make that clear now while users can work around it in the short term with a
> config setting rather than waiting until we're faced with some security or
> other issue that forces a change to the output which users cannot work
> around so easily.

I would be in favor of adding a config option that allows using the
internal gzip option, although leave the default to be keep things
compatible.

The reason for that it should be easy for a forge provider such as
GitHub to break things, deliberately.  Sound insane?  Hear me out.

At $WORK, we have a highly reliable system, Paxos.  It is a highly
fault-tolerant system, so it rarely fails.  But "rarely fails" is not
the same as "never fails".  And hopefully, things should degrade
gracefully if there is a Paxos outage.  But as the Google SRE's are
fond of saying, "Hope is not a strategy".

So periodically, the people who run the Paxos service will
deliberately force downtime for a short amount of time.  The fact that
they will do this is well advertised, and scheduled ahead of time ---
and teams responsible for user-facing services are supposed to make
sure that end-users don't notice when this happens.  Maybe they won't
be able to update configurations as easily while Paxos is down, but it
shouldn't cause a user-visible outage.

So what I would recommend to the GitHub product manager, is that once
a quarter, on a well-advertised date, that they flip the switch and
break the git archive checksums for say, an hour.  Then next quarter,
they advertise that the switch will be thrown for 2 hours, doubling
each time, until it is ramped up to 16 hours.

This will provide the necessary nudge so that all of these badly
designed systems that depend on downloaded archives of arbitrary git
hubs to be stable will rethink their position, while minimizing the
end-user customer impact.  Otherwise, I predict that Bazel, homebrew,
etc will consider to rely on this ill-considered assumption, and at
some point in the future, when we *do* have a much better reason to
want to make a change to the tar or compression algorithm, all of
these end users will once again scream bloody murder.

Of course, this is going to be up to each forge provider to decide
whether they want to do this.  But we can make it easy for them to do
this thing, and I'd argue it is in our interest to make it easy for
them to do this.  Otherwise we'll get constrained in the future by the
fear of massive user blowback, no metter what we say in our
documentation regarding "no promises --- and next time, we really
mean it!"

	      	       	       	    	  - Ted
René Scharfe Feb. 4, 2023, 6:08 p.m. UTC | #7
Am 02.02.23 um 17:25 schrieb Junio C Hamano:
> Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:
>
>> As the disruption of changing the default isn't worth it, let's use
>> gzip(1) again by default, and only fall back on the new "git archive
>> gzip" if it isn't available.
>
> It perhaps is OK, and lets us answer "ugh, the compressed output of
> 'git archive' is unstable again" with "we didn't change anything,
> perhaps you changed your gzip(1)?" when they fix bugs or improve
> compression or whatever.  Of course that is not an overall win for
> the end users, but in the short term until gzip gets such a change,
> we would presumably get the "same" output as before.

Restoring the old default is an understandable reflex.  In theory it
worsens consistency and stability of the output, but in practice using
whatever was found in $PATH did work before -- or at least it was not
our problem if it didn't.

Are there still people left that would benefit from such a step back,
however?  As far as I understand forges like GitHub relied on git
archive producing the same tgz output across versions.  That assumption
was violated, trust lost.  They had to learn about the configuration
option tar.tgz.command or find some other way to cope.  Changing the
default again won't undo that.

René
Ævar Arnfjörð Bjarmason Feb. 5, 2023, 9:30 p.m. UTC | #8
On Sat, Feb 04 2023, René Scharfe wrote:

> Am 02.02.23 um 17:25 schrieb Junio C Hamano:
>> Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:
>>
>>> As the disruption of changing the default isn't worth it, let's use
>>> gzip(1) again by default, and only fall back on the new "git archive
>>> gzip" if it isn't available.
>>
>> It perhaps is OK, and lets us answer "ugh, the compressed output of
>> 'git archive' is unstable again" with "we didn't change anything,
>> perhaps you changed your gzip(1)?" when they fix bugs or improve
>> compression or whatever.  Of course that is not an overall win for
>> the end users, but in the short term until gzip gets such a change,
>> we would presumably get the "same" output as before.
>
> Restoring the old default is an understandable reflex.  In theory it
> worsens consistency and stability of the output, but in practice using
> whatever was found in $PATH did work before -- or at least it was not
> our problem if it didn't.

"In theory" because the user might be flip-flopping between different
gzip(1) versions?

> Are there still people left that would benefit from such a step back,
> however?  As far as I understand forges like GitHub relied on git
> archive producing the same tgz output across versions.  That assumption
> was violated, trust lost.  They had to learn about the configuration
> option tar.tgz.command or find some other way to cope.  Changing the
> default again won't undo that.

I think it's safe to assume that git is used by enough users that
anything breaking at a major hosting provider is likely to have a very
long tail in the wild, almost all of which we'll never see in "this
broke for me" reports to this ML.

So no, that ship has clearly sailed for GitHub, but this series aims to
address more than that.

Even if it wasn't for that breakage, I think 4/9 and 6/9 here show the
main problem you were trying to solve in making "git archive gzip" the
default didn't need to be solved by changing the default. I.e. the aim
was to have it work when "gzip(1)" wasn't available, which we can do by
falling back only if we can't invoke it, rather than changing the
long-standing default.
Phillip Wood Feb. 6, 2023, 2:46 p.m. UTC | #9
On 03/02/2023 13:49, Ævar Arnfjörð Bjarmason wrote:
> 
> On Thu, Feb 02 2023, Phillip Wood wrote: >> Reverting the change gives the misleading impression that we're making
>> a commitment to keeping the output stable.
> 
> I don't see how you can conclude that from this series. It explicitly
> states that we make no such promises, what it does is go back to
> allowing the gzip(1) command to make its own promises.

This series would not be happening if we were not reverting a change to 
the compressed output of 'git archive'. The documentation updates are 
very welcome but I think we're undermining the message that the 
compressed output can change by reverting that change.

>> The focus of this thread seems to be the
>> problems relating to github which they have already addressed.
> 
> Which they've addressed by reverting the change, but while they're a
> major user of git they're not the only one. They just happened to use
> "git archive".
> 
> I think it would be a mistake to conclude that everyone who's run into
> this has already done so, or is aware of it.

I've spent some time trying to find reports of problems caused by this 
change and have not seen anything apart from the issue with GitHub. 
Although it takes a while for new versions of git to get into linux 
distributions if there is a widespread problem we normally hear about it 
pretty quickly. This change has been in two releases now. If anyone does 
have a problem there is an easy fix in the form of setting 
tar.<format>.command

>> I think there is general agreement that it is not practical to promise
>> that the compressed output of "git archive" is stable so maybe it is
>> better[...]
> 
> ...better than what? This seems to imply that this series is making new
> promises about the output stability, which it isn't doing.

It's better people realize they cannot rely on the output being stable 
now when they can safely work around the problem while working on a 
proper fix rather than waiting until the change in output is caused by a 
security issue in gzip which means the work around is no longer safe.

Best Wishes

Phillip

>> [...]to make that clear now while users can work around it in the
>> short term with a config setting rather than waiting until we're faced
>> with some security or other issue that forces a change to the output
>> which users cannot work around so easily.
> 
> I think it's always been clear that you can use that setting. For ages
> we've been saying:
> 
> 	The `tar.gz` and `tgz` formats are defined automatically and use the
> 	command `gzip -cn` by default.
> 
> Then v2.38.0 changed it to:
> 
> 	[...]
>          magic command `git archive gzip` by default
> 
> Which IMO was easily missed among other "Performance, Internal
> Implementation, Development Support etc." items in the release notes,
> which said:
> 
>     Teach "git archive" to (optionally and then by default) avoid
>     spawning an external "gzip" process when creating ".tar.gz" (and
>     ".tgz") archives.
> 
> But I agree that all of this is subjective. To me a 2% reduction in CPU
> use (at the cost of ~20% increse in wallclock) & some unclear benefits
> to teaching users that they can't rely on our "gzip" output seems
> unclear or hypothetical.
> 
> Whereas the widespread breakage reported is very real,

where are the reports of widespread berakage outside of GitHub?

> and we should
> consider GitHub as a canary for that, not the the stand & end of its
> potential impact.
> 
> As we didn't have a strong reason to change this in the first place (and
> as my series shows, we can have our cake & eat it too if we don't have a
> "gzip") I think the obvious choice is to go back to using "gzip".
René Scharfe Feb. 12, 2023, 5:41 p.m. UTC | #10
Am 05.02.23 um 22:30 schrieb Ævar Arnfjörð Bjarmason:
>
> On Sat, Feb 04 2023, René Scharfe wrote:
>
>> Am 02.02.23 um 17:25 schrieb Junio C Hamano:
>>> Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:
>>>
>>>> As the disruption of changing the default isn't worth it, let's use
>>>> gzip(1) again by default, and only fall back on the new "git archive
>>>> gzip" if it isn't available.
>>>
>>> It perhaps is OK, and lets us answer "ugh, the compressed output of
>>> 'git archive' is unstable again" with "we didn't change anything,
>>> perhaps you changed your gzip(1)?" when they fix bugs or improve
>>> compression or whatever.  Of course that is not an overall win for
>>> the end users, but in the short term until gzip gets such a change,
>>> we would presumably get the "same" output as before.
>>
>> Restoring the old default is an understandable reflex.  In theory it
>> worsens consistency and stability of the output, but in practice using
>> whatever was found in $PATH did work before -- or at least it was not
>> our problem if it didn't.
>
> "In theory" because the user might be flip-flopping between different
> gzip(1) versions?

No flopping needed.  We can't control what's in $PATH.  There are
OS-specific replacements for GNU gzip in NetBSD/FreeBSD/macOS and
OpenBSD.  People could use pigz.  Or cat, for that matter.  Different
versions of different tools might produce different output.

There are alternative to the original libz as well, e.g. libz-ng.  We
don't control which one or which version is installed, either, but we
could do so if we wanted by importing one of them like we did with
LibXDiff.

> Even if it wasn't for that breakage, I think 4/9 and 6/9 here show the
> main problem you were trying to solve in making "git archive gzip" the
> default didn't need to be solved by changing the default. I.e. the aim
> was to have it work when "gzip(1)" wasn't available, which we can do by
> falling back only if we can't invoke it, rather than changing the
> long-standing default.

The aim was to no longer depend on gzip.  That goal was already met by
providing the internal implementation, without changing the default.
Git for Windows for example could use it in their config and drop gzip.

Calling gzip if available, warning if it isn't and using the internal
implementation adds yet more variance.  No longer allowing gzip to be a
shell alias might confuse someone.  The automatic fallback would only
benefit users that don't want to touch /etc/gitconfig, have nobody to
do it for them and don't care about warnings -- hopefully not a big
crowd.

I didn't intend the change of default to be that painful, but don't see
the point in going back now that we're through.  The new default is
better -- one less dependency to care about.  And if we need to go
back, however, then a know-good state makes more sense than a smart
fallback with some new twists.

René