Message ID | cover-0.9-00000000000-20230202T093212Z-avarab@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | git archive: use gzip again by default, document output stabilty | expand |
Hi Ævar On 02/02/2023 09:32, Ævar Arnfjörð Bjarmason wrote: > As reported in > https://lore.kernel.org/git/a812a664-67ea-c0ba-599f-cb79e2d96694@gmail.com/ > changing the default "tgz" output method of from "gzip(1)" to our > internal "git archive gzip" (using zlib ) broke things for users in > the wild that assume that the "git archive" output is stable, most > notably GitHub: https://github.com/orgs/community/discussions/45830 > > Leaving aside the larger question of whether we're going to promise > output stability for "git archive" in general, the motivation for that > change was to have a working compression method on systems that lacked > a gzip(1). As I recall the reduction in cpu time used to create a compressed archive was a factor in making it the default. > As the disruption of changing the default isn't worth it, let's use > gzip(1) again by default, and only fall back on the new "git archive > gzip" if it isn't available. Playing devil's advocate for a moment as we're not going to promise that the compressed output of "git archive" will be stable in the future perhaps we should use this breakage as an opportunity to highlight that to users and to advertize the config setting that allows them to use gzip for compressing archives. Reverting the change gives the misleading impression that we're making a commitment to keeping the output stable. The focus of this thread seems to be the problems relating to github which they have already addressed. I think there is general agreement that it is not practical to promise that the compressed output of "git archive" is stable so maybe it is better to make that clear now while users can work around it in the short term with a config setting rather than waiting until we're faced with some security or other issue that forces a change to the output which users cannot work around so easily. Best Wishes Phillip > The later parts of this series then document and test for the output > stability of the command. > > We're not promising anything new there, except that we now promise > that we're going to use "gzip" as the default compressor, but that > it's up to that command to be stable, should the user desire output > stability. > > The documentation discusses the various caveats involved, suggests > alternatives to checksumming compressed archives, but in the end notes > what's been the policy so far: We're not promising that the "tar" > output is going to be stable. > > The early parts of this series (1-2/9) are clean-up for existing > config drift, as later in the series we'll otherwise need to change > the divergent config documentation in two places. > > CI & branch for this at: > https://github.com/avar/git/tree/avar/archive-internal-gzip-not-the-default > > Ævar Arnfjörð Bjarmason (9): > archive & tar config docs: de-duplicate configuration section > git config docs: document "tar.<format>.{command,remote}" > archiver API: make the "flags" in "struct archiver" an enum > archive: omit the shell for built-in "command" filters > archive-tar.c: move internal gzip implementation to a function > archive: use "gzip -cn" for stability, not "git archive gzip" > test-lib.sh: add a lazy GZIP prerequisite > archive tests: test for "gzip -cn" and "git archive gzip" stability > git archive docs: document output non-stability > > Documentation/config/tar.txt | 29 +++++++- > Documentation/git-archive.txt | 96 +++++++++++++++++++------- > archive-tar.c | 78 ++++++++++++++------- > archive.h | 11 +-- > t/t5000-tar-tree.sh | 2 - > t/t5005-archive-stability.sh | 70 +++++++++++++++++++ > t/t5562-http-backend-content-length.sh | 2 - > t/test-lib.sh | 4 ++ > 8 files changed, 231 insertions(+), 61 deletions(-) > create mode 100755 t/t5005-archive-stability.sh >
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > As the disruption of changing the default isn't worth it, let's use > gzip(1) again by default, and only fall back on the new "git archive > gzip" if it isn't available. It perhaps is OK, and lets us answer "ugh, the compressed output of 'git archive' is unstable again" with "we didn't change anything, perhaps you changed your gzip(1)?" when they fix bugs or improve compression or whatever. Of course that is not an overall win for the end users, but in the short term until gzip gets such a change, we would presumably get the "same" output as before.
Phillip Wood <phillip.wood123@gmail.com> writes: > ... Reverting the change > gives the misleading impression that we're making a commitment to > keeping the output stable. The focus of this thread seems to be the > problems relating to github which they have already addressed. > > I think there is general agreement that it is not practical to promise > that the compressed output of "git archive" is stable so maybe it is > better to make that clear now while users can work around it in the > short term with a config setting rather than waiting until we're faced > with some security or other issue that forces a change to the output > which users cannot work around so easily. I love to see somebody else play the devil's advocate role. Thanks for all of the above.
February 2, 2023 11:17 AM, "Phillip Wood" <phillip.wood123@gmail.com> wrote: > Playing devil's advocate for a moment as we're not going to promise that the compressed output of > "git archive" will be stable in the future perhaps we should use this breakage as an opportunity to > highlight that to users and to advertize the config setting that allows them to use gzip for > compressing archives. Reverting the change gives the misleading impression that we're making a > commitment to keeping the output stable. The focus of this thread seems to be the problems relating > to github which they have already addressed. > > I think there is general agreement that it is not practical to promise that the compressed output > of "git archive" is stable so maybe it is better to make that clear now while users can work around > it in the short term with a config setting rather than waiting until we're faced with some security > or other issue that forces a change to the output which users cannot work around so easily. Reverting to the behavior of "use some arbitrary gzip from $PATH" would be a poor decision whether or not git were willing to make some commitment to gzip stability, because Git does not control arbitrary gzips on the user's $PATH. If Git did want to promise gzip stability, it could only start from something like the current internal implementation along with a vendored zlib; if it doesn't, as appears to be the case, then the internal implementation is superior for the other reasons already discussed. If the user wants to depend on a particular gzip executable they supply, this configuration knob already exists for them. Since there is no guarantee of stability, but there has been a popular misconception that there is some such guarantee (e.g., [1]), some kind of STABILITY section describing how there isn't any and suggesting ways the user can attain more stability via configuration seems to be a good idea. [1]: https://lists.reproducible-builds.org/pipermail/rb-general/2021-October/002422.html
On Thu, Feb 02 2023, Phillip Wood wrote: > On 02/02/2023 09:32, Ævar Arnfjörð Bjarmason wrote: >> As reported in >> https://lore.kernel.org/git/a812a664-67ea-c0ba-599f-cb79e2d96694@gmail.com/ >> changing the default "tgz" output method of from "gzip(1)" to our >> internal "git archive gzip" (using zlib ) broke things for users in >> the wild that assume that the "git archive" output is stable, most >> notably GitHub: https://github.com/orgs/community/discussions/45830 >> >> Leaving aside the larger question of whether we're going to promise >> output stability for "git archive" in general, the motivation for that >> change was to have a working compression method on systems that lacked >> a gzip(1). > > As I recall the reduction in cpu time used to create a compressed > archive was a factor in making it the default. I read those references in 76d7602631a (archive-tar: add internal gzip implementation, 2022-06-15) more of a "it's not [much] slower", the flip to the default in 4f4be00d302 (archive-tar: use internal gzip by default, 2022-06-15) didn't discuss it. So I didn't think it was important enough to mention (even though we're now back to the faster "gzip" method). >> As the disruption of changing the default isn't worth it, let's use >> gzip(1) again by default, and only fall back on the new "git archive >> gzip" if it isn't available. > > Playing devil's advocate for a moment as we're not going to promise > that the compressed output of "git archive" will be stable in the > future perhaps we should use this breakage as an opportunity to > highlight that to users and to advertize the config setting that > allows them to use gzip for compressing archives. If we were trying to intentionally break things for those users we could do a lot better than "git archive gzip", whose output is mostly the same as "gzip", we could tweak one of the headers to make it different all the time. But I think it's better to advocate for such intentional chaos-monkeying as a follow-up to this more conservative "oops, we broke stuff, it's easy not to break it, so let's not do it'. > Reverting the change gives the misleading impression that we're making > a commitment to keeping the output stable. I don't see how you can conclude that from this series. It explicitly states that we make no such promises, what it does is go back to allowing the gzip(1) command to make its own promises. > The focus of this thread seems to be the > problems relating to github which they have already addressed. Which they've addressed by reverting the change, but while they're a major user of git they're not the only one. They just happened to use "git archive". I think it would be a mistake to conclude that everyone who's run into this has already done so, or is aware of it. > I think there is general agreement that it is not practical to promise > that the compressed output of "git archive" is stable so maybe it is > better[...] ...better than what? This seems to imply that this series is making new promises about the output stability, which it isn't doing. > [...]to make that clear now while users can work around it in the > short term with a config setting rather than waiting until we're faced > with some security or other issue that forces a change to the output > which users cannot work around so easily. I think it's always been clear that you can use that setting. For ages we've been saying: The `tar.gz` and `tgz` formats are defined automatically and use the command `gzip -cn` by default. Then v2.38.0 changed it to: [...] magic command `git archive gzip` by default Which IMO was easily missed among other "Performance, Internal Implementation, Development Support etc." items in the release notes, which said: Teach "git archive" to (optionally and then by default) avoid spawning an external "gzip" process when creating ".tar.gz" (and ".tgz") archives. But I agree that all of this is subjective. To me a 2% reduction in CPU use (at the cost of ~20% increse in wallclock) & some unclear benefits to teaching users that they can't rely on our "gzip" output seems unclear or hypothetical. Whereas the widespread breakage reported is very real, and we should consider GitHub as a canary for that, not the the stand & end of its potential impact. As we didn't have a strong reason to change this in the first place (and as my series shows, we can have our cake & eat it too if we don't have a "gzip") I think the obvious choice is to go back to using "gzip".
On Thu, Feb 02, 2023 at 04:17:09PM +0000, Phillip Wood wrote: > Playing devil's advocate for a moment as we're not going to promise that the > compressed output of "git archive" will be stable in the future perhaps we > should use this breakage as an opportunity to highlight that to users and to > advertize the config setting that allows them to use gzip for compressing > archives. Reverting the change gives the misleading impression that we're > making a commitment to keeping the output stable. The focus of this thread > seems to be the problems relating to github which they have already > addressed. > > I think there is general agreement that it is not practical to promise that > the compressed output of "git archive" is stable so maybe it is better to > make that clear now while users can work around it in the short term with a > config setting rather than waiting until we're faced with some security or > other issue that forces a change to the output which users cannot work > around so easily. I would be in favor of adding a config option that allows using the internal gzip option, although leave the default to be keep things compatible. The reason for that it should be easy for a forge provider such as GitHub to break things, deliberately. Sound insane? Hear me out. At $WORK, we have a highly reliable system, Paxos. It is a highly fault-tolerant system, so it rarely fails. But "rarely fails" is not the same as "never fails". And hopefully, things should degrade gracefully if there is a Paxos outage. But as the Google SRE's are fond of saying, "Hope is not a strategy". So periodically, the people who run the Paxos service will deliberately force downtime for a short amount of time. The fact that they will do this is well advertised, and scheduled ahead of time --- and teams responsible for user-facing services are supposed to make sure that end-users don't notice when this happens. Maybe they won't be able to update configurations as easily while Paxos is down, but it shouldn't cause a user-visible outage. So what I would recommend to the GitHub product manager, is that once a quarter, on a well-advertised date, that they flip the switch and break the git archive checksums for say, an hour. Then next quarter, they advertise that the switch will be thrown for 2 hours, doubling each time, until it is ramped up to 16 hours. This will provide the necessary nudge so that all of these badly designed systems that depend on downloaded archives of arbitrary git hubs to be stable will rethink their position, while minimizing the end-user customer impact. Otherwise, I predict that Bazel, homebrew, etc will consider to rely on this ill-considered assumption, and at some point in the future, when we *do* have a much better reason to want to make a change to the tar or compression algorithm, all of these end users will once again scream bloody murder. Of course, this is going to be up to each forge provider to decide whether they want to do this. But we can make it easy for them to do this thing, and I'd argue it is in our interest to make it easy for them to do this. Otherwise we'll get constrained in the future by the fear of massive user blowback, no metter what we say in our documentation regarding "no promises --- and next time, we really mean it!" - Ted
Am 02.02.23 um 17:25 schrieb Junio C Hamano: > Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > >> As the disruption of changing the default isn't worth it, let's use >> gzip(1) again by default, and only fall back on the new "git archive >> gzip" if it isn't available. > > It perhaps is OK, and lets us answer "ugh, the compressed output of > 'git archive' is unstable again" with "we didn't change anything, > perhaps you changed your gzip(1)?" when they fix bugs or improve > compression or whatever. Of course that is not an overall win for > the end users, but in the short term until gzip gets such a change, > we would presumably get the "same" output as before. Restoring the old default is an understandable reflex. In theory it worsens consistency and stability of the output, but in practice using whatever was found in $PATH did work before -- or at least it was not our problem if it didn't. Are there still people left that would benefit from such a step back, however? As far as I understand forges like GitHub relied on git archive producing the same tgz output across versions. That assumption was violated, trust lost. They had to learn about the configuration option tar.tgz.command or find some other way to cope. Changing the default again won't undo that. René
On Sat, Feb 04 2023, René Scharfe wrote: > Am 02.02.23 um 17:25 schrieb Junio C Hamano: >> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: >> >>> As the disruption of changing the default isn't worth it, let's use >>> gzip(1) again by default, and only fall back on the new "git archive >>> gzip" if it isn't available. >> >> It perhaps is OK, and lets us answer "ugh, the compressed output of >> 'git archive' is unstable again" with "we didn't change anything, >> perhaps you changed your gzip(1)?" when they fix bugs or improve >> compression or whatever. Of course that is not an overall win for >> the end users, but in the short term until gzip gets such a change, >> we would presumably get the "same" output as before. > > Restoring the old default is an understandable reflex. In theory it > worsens consistency and stability of the output, but in practice using > whatever was found in $PATH did work before -- or at least it was not > our problem if it didn't. "In theory" because the user might be flip-flopping between different gzip(1) versions? > Are there still people left that would benefit from such a step back, > however? As far as I understand forges like GitHub relied on git > archive producing the same tgz output across versions. That assumption > was violated, trust lost. They had to learn about the configuration > option tar.tgz.command or find some other way to cope. Changing the > default again won't undo that. I think it's safe to assume that git is used by enough users that anything breaking at a major hosting provider is likely to have a very long tail in the wild, almost all of which we'll never see in "this broke for me" reports to this ML. So no, that ship has clearly sailed for GitHub, but this series aims to address more than that. Even if it wasn't for that breakage, I think 4/9 and 6/9 here show the main problem you were trying to solve in making "git archive gzip" the default didn't need to be solved by changing the default. I.e. the aim was to have it work when "gzip(1)" wasn't available, which we can do by falling back only if we can't invoke it, rather than changing the long-standing default.
On 03/02/2023 13:49, Ævar Arnfjörð Bjarmason wrote: > > On Thu, Feb 02 2023, Phillip Wood wrote: >> Reverting the change gives the misleading impression that we're making >> a commitment to keeping the output stable. > > I don't see how you can conclude that from this series. It explicitly > states that we make no such promises, what it does is go back to > allowing the gzip(1) command to make its own promises. This series would not be happening if we were not reverting a change to the compressed output of 'git archive'. The documentation updates are very welcome but I think we're undermining the message that the compressed output can change by reverting that change. >> The focus of this thread seems to be the >> problems relating to github which they have already addressed. > > Which they've addressed by reverting the change, but while they're a > major user of git they're not the only one. They just happened to use > "git archive". > > I think it would be a mistake to conclude that everyone who's run into > this has already done so, or is aware of it. I've spent some time trying to find reports of problems caused by this change and have not seen anything apart from the issue with GitHub. Although it takes a while for new versions of git to get into linux distributions if there is a widespread problem we normally hear about it pretty quickly. This change has been in two releases now. If anyone does have a problem there is an easy fix in the form of setting tar.<format>.command >> I think there is general agreement that it is not practical to promise >> that the compressed output of "git archive" is stable so maybe it is >> better[...] > > ...better than what? This seems to imply that this series is making new > promises about the output stability, which it isn't doing. It's better people realize they cannot rely on the output being stable now when they can safely work around the problem while working on a proper fix rather than waiting until the change in output is caused by a security issue in gzip which means the work around is no longer safe. Best Wishes Phillip >> [...]to make that clear now while users can work around it in the >> short term with a config setting rather than waiting until we're faced >> with some security or other issue that forces a change to the output >> which users cannot work around so easily. > > I think it's always been clear that you can use that setting. For ages > we've been saying: > > The `tar.gz` and `tgz` formats are defined automatically and use the > command `gzip -cn` by default. > > Then v2.38.0 changed it to: > > [...] > magic command `git archive gzip` by default > > Which IMO was easily missed among other "Performance, Internal > Implementation, Development Support etc." items in the release notes, > which said: > > Teach "git archive" to (optionally and then by default) avoid > spawning an external "gzip" process when creating ".tar.gz" (and > ".tgz") archives. > > But I agree that all of this is subjective. To me a 2% reduction in CPU > use (at the cost of ~20% increse in wallclock) & some unclear benefits > to teaching users that they can't rely on our "gzip" output seems > unclear or hypothetical. > > Whereas the widespread breakage reported is very real, where are the reports of widespread berakage outside of GitHub? > and we should > consider GitHub as a canary for that, not the the stand & end of its > potential impact. > > As we didn't have a strong reason to change this in the first place (and > as my series shows, we can have our cake & eat it too if we don't have a > "gzip") I think the obvious choice is to go back to using "gzip".
Am 05.02.23 um 22:30 schrieb Ævar Arnfjörð Bjarmason: > > On Sat, Feb 04 2023, René Scharfe wrote: > >> Am 02.02.23 um 17:25 schrieb Junio C Hamano: >>> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: >>> >>>> As the disruption of changing the default isn't worth it, let's use >>>> gzip(1) again by default, and only fall back on the new "git archive >>>> gzip" if it isn't available. >>> >>> It perhaps is OK, and lets us answer "ugh, the compressed output of >>> 'git archive' is unstable again" with "we didn't change anything, >>> perhaps you changed your gzip(1)?" when they fix bugs or improve >>> compression or whatever. Of course that is not an overall win for >>> the end users, but in the short term until gzip gets such a change, >>> we would presumably get the "same" output as before. >> >> Restoring the old default is an understandable reflex. In theory it >> worsens consistency and stability of the output, but in practice using >> whatever was found in $PATH did work before -- or at least it was not >> our problem if it didn't. > > "In theory" because the user might be flip-flopping between different > gzip(1) versions? No flopping needed. We can't control what's in $PATH. There are OS-specific replacements for GNU gzip in NetBSD/FreeBSD/macOS and OpenBSD. People could use pigz. Or cat, for that matter. Different versions of different tools might produce different output. There are alternative to the original libz as well, e.g. libz-ng. We don't control which one or which version is installed, either, but we could do so if we wanted by importing one of them like we did with LibXDiff. > Even if it wasn't for that breakage, I think 4/9 and 6/9 here show the > main problem you were trying to solve in making "git archive gzip" the > default didn't need to be solved by changing the default. I.e. the aim > was to have it work when "gzip(1)" wasn't available, which we can do by > falling back only if we can't invoke it, rather than changing the > long-standing default. The aim was to no longer depend on gzip. That goal was already met by providing the internal implementation, without changing the default. Git for Windows for example could use it in their config and drop gzip. Calling gzip if available, warning if it isn't and using the internal implementation adds yet more variance. No longer allowing gzip to be a shell alias might confuse someone. The automatic fallback would only benefit users that don't want to touch /etc/gitconfig, have nobody to do it for them and don't care about warnings -- hopefully not a big crowd. I didn't intend the change of default to be that painful, but don't see the point in going back now that we're through. The new default is better -- one less dependency to care about. And if we need to go back, however, then a know-good state makes more sense than a smart fallback with some new twists. René