Message ID | pull.1439.git.1670433958.gitgitgadget@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | Optionally skip hashing index on write | expand |
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > Writing the index is a critical action that takes place in multiple Git > commands. The recent performance improvements available with the sparse > index show how often the I/O costs around the index can affect different Git > commands, although reading the index takes place more often than a write. The sparse-index work is great in that it offers correctness while taking advantage of the knowledge of which part of the tree is quiescent and unused to boost performance. I am not sure a change to reduce file safety can be compared with it, in that one is pure improvement, while the other is trade-off. As long as we will keep the "create into a new file, write it fully and fsync + rename to the final" pattern, we do not need the trailing checksum to protect us from a truncated output due to index-writing process dying in the middle, so I do not mind that trade-off, though. Protecting files from bit flipping filesystem corruption is a different matter. Folks at hosting sites like GitHub would know how often they detect object corruption (I presume they do not have to deal with the index file on the server end that often, but loose and pack object files have the trailing checksums the same way) thanks to the trailing checksum, and what the consequences are if we lost that safety (I am guessing it would be minimum, though). Thanks.
On Thu, Dec 08 2022, Junio C Hamano wrote: > Protecting files from bit flipping filesystem corruption is a > different matter. Folks at hosting sites like GitHub would know how > often they detect object corruption (I presume they do not have to > deal with the index file on the server end that often, but loose and > pack object files have the trailing checksums the same way) thanks > to the trailing checksum, and what the consequences are if we lost > that safety (I am guessing it would be minimum, though). I don't think this checksum does much for us in practice, but just on this point in general: Extrapolating results at <hosting site> when it comes to making general decisions about git's data safety isn't a good idea. I don't know about GitHub's hardware, but servers almost universally use ECC ram, and tend to use things like error-correcting filesystem RAID etc. Data in that area is really interesting when it comes to running git in that sort of setup, but it really shouldn't be extrapolated to git's userbase in general. A lot of those users will be using cheap memory and/or storage devices without any error correction. They're also likely to stress our reliability guarantees in other ways, e.g. yanking their power cord (or equivalent), which a server typically won't need to deal with.
On 12/7/2022 6:27 PM, Junio C Hamano wrote: > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > >> Writing the index is a critical action that takes place in multiple Git >> commands. The recent performance improvements available with the sparse >> index show how often the I/O costs around the index can affect different Git >> commands, although reading the index takes place more often than a write. > > The sparse-index work is great in that it offers correctness while > taking advantage of the knowledge of which part of the tree is > quiescent and unused to boost performance. I am not sure a change > to reduce file safety can be compared with it, in that one is pure > improvement, while the other is trade-off. I agree that this is a trade-off, and we should both be careful about whether or not we even make this a possibility for certain file formats. The index is an interesting case for a couple reasons: 1. Writes block users. Writing the index takes place in many user- blocking foreground operations. The speed improvement directly impacts their use. Other file formats are typically written in the background (commit-graph, multi-pack-index) or are super- critical to correctness (pack-files). 2. Index files are short lived. It is rare that a user leaves an index for a long time with many staged changes. That's the condition that's required for losing an index file to cause a loss of work (or maybe I'm missing something). Outside of staged changes, the index can be completely destroyed and rewritten with minimal impact to the user. > As long as we will keep the "create into a new file, write it fully > and fsync + rename to the final" pattern, we do not need the trailing > checksum to protect us from a truncated output due to index-writing > process dying in the middle, so I do not mind that trade-off, though. > > Protecting files from bit flipping filesystem corruption is a > different matter. Folks at hosting sites like GitHub would know how > often they detect object corruption (I presume they do not have to > deal with the index file on the server end that often, but loose and > pack object files have the trailing checksums the same way) thanks > to the trailing checksum, and what the consequences are if we lost > that safety (I am guessing it would be minimum, though). I agree that we need to be careful about which files get this treatement. But I also want to point out that I'm not using hosting servers as evidence that this has worked in practice, but instead many developer machines in large monorepos who have had this enabled (via the microsoft/git fork) for years. We've not come across an instance where this loss of a trailing hash has been an issue. Thanks, -Stolee
On Thu, Dec 8, 2022 at 8:50 AM Derrick Stolee <derrickstolee@github.com> wrote: > > On 12/7/2022 6:27 PM, Junio C Hamano wrote: > > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > > > >> Writing the index is a critical action that takes place in multiple Git > >> commands. The recent performance improvements available with the sparse > >> index show how often the I/O costs around the index can affect different Git > >> commands, although reading the index takes place more often than a write. > > > > The sparse-index work is great in that it offers correctness while > > taking advantage of the knowledge of which part of the tree is > > quiescent and unused to boost performance. I am not sure a change > > to reduce file safety can be compared with it, in that one is pure > > improvement, while the other is trade-off. > > I agree that this is a trade-off, and we should both be careful about > whether or not we even make this a possibility for certain file > formats. The index is an interesting case for a couple reasons: > > 1. Writes block users. Writing the index takes place in many user- > blocking foreground operations. The speed improvement directly > impacts their use. Other file formats are typically written in > the background (commit-graph, multi-pack-index) or are super- > critical to correctness (pack-files). > > 2. Index files are short lived. It is rare that a user leaves an > index for a long time with many staged changes. That's the condition > that's required for losing an index file to cause a loss of work > (or maybe I'm missing something). Outside of staged changes, the > index can be completely destroyed and rewritten with minimal impact > to the user. > Is this information in the commit messages somewhere? I didn't see it in the cover letter. Nor did I see any other explanation in the cover letter besides "this makes it faster". I would expect such trade off or analysis of "what do we lose" to be in the cover letter, as it may not be clear otherwise. I do agree these reasons are good, but it can be confusing to later reviewers when looking back at the code for an option like this and wondering why it exists. Thanks, Jake