Message ID | 20180117184828.31816-1-hch@lst.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Christoph Hellwig <hch@lst.de> writes: > fsync is required for data integrity as there is no gurantee that > data makes it to disk at any specified time without it. Even for > ext3 with data=ordered mode the file system will only commit all > data at some point in time that is not guaranteed. It comes from this one: commit aafe9fbaf4f1d1f27a6f6e3eb3e246fff81240ef Author: Linus Torvalds <torvalds@linux-foundation.org> Date: Wed Jun 18 15:18:44 2008 -0700 Add config option to enable 'fsync()' of object files As explained in the documentation[*] this is totally useless on filesystems that do ordered/journalled data writes, but it can be a useful safety feature on filesystems like HFS+ that only journal the metadata, not the actual file contents. It defaults to off, although we could presumably in theory some day auto-enable it on a per-filesystem basis. [*] Yes, I updated the docs for the thing. Hell really _has_ frozen over, and the four horsemen are probably just beyond the horizon. EVERYBODY PANIC! > diff --git a/Documentation/config.txt b/Documentation/config.txt > index 0e25b2c92..9a1cec5c8 100644 > --- a/Documentation/config.txt > +++ b/Documentation/config.txt > @@ -866,10 +866,8 @@ core.whitespace:: > core.fsyncObjectFiles:: > This boolean will enable 'fsync()' when writing object files. > + > -This is a total waste of time and effort on a filesystem that orders > -data writes properly, but can be useful for filesystems that do not use > -journalling (traditional UNIX filesystems) or that only journal metadata > -and not file contents (OS X's HFS+, or Linux ext3 with "data=writeback"). > +This option is enabled by default and ensures actual data integrity > +by calling fsync after writing object files. I am somewhat sympathetic to the desire to flip the default to "safe" and allow those who know they are already safe to tweak the knob for performance, and it also makes sense to document that the default is "true" here. But I do not see the point of removing the four lines from this paragraph; the sole effect of the removal is to rob information from readers that they can use to decide if they want to disable the configuration, no?
On Wed, Jan 17, 2018 at 11:04:32AM -0800, Junio C Hamano wrote: > Christoph Hellwig <hch@lst.de> writes: > > @@ -866,10 +866,8 @@ core.whitespace:: > > core.fsyncObjectFiles:: > > This boolean will enable 'fsync()' when writing object files. > > + > > -This is a total waste of time and effort on a filesystem that orders > > -data writes properly, but can be useful for filesystems that do not use > > -journalling (traditional UNIX filesystems) or that only journal metadata > > -and not file contents (OS X's HFS+, or Linux ext3 with "data=writeback"). > > +This option is enabled by default and ensures actual data integrity > > +by calling fsync after writing object files. > > I am somewhat sympathetic to the desire to flip the default to > "safe" and allow those who know they are already safe to tweak the > knob for performance, and it also makes sense to document that the > default is "true" here. But I do not see the point of removing the > four lines from this paragraph; the sole effect of the removal is to > rob information from readers that they can use to decide if they > want to disable the configuration, no? How about this instead? This option is enabled by default and ensures data integrity by calling fsync after writing object files. It is not necessary on filesystems which journal data writes, but is still necessary on filesystems which do not use journalling (ext2), or that only journal metadata writes (OS X's HFS+, or Linux's ext4 with "data=writeback"). Turning this option off will increase performance at the possible risk of data loss.
On Wed, Jan 17, 2018 at 11:37:31AM -0800, Matthew Wilcox wrote: > How about this instead? > > This option is enabled by default and ensures data integrity by calling > fsync after writing object files. It is not necessary on filesystems > which journal data writes, but is still necessary on filesystems which > do not use journalling (ext2), or that only journal metadata writes > (OS X's HFS+, or Linux's ext4 with "data=writeback"). Turning this > option off will increase performance at the possible risk of data loss. I think this goes entirely into the wrong direction. The point is fsync is always the right thing to do. But on ext3 (and ext3 only) the right thing is way too painful, and conveniently ext3 happens to be almost ok without it. So if anything should get a special mention it is ext3.
On Wed, Jan 17, 2018 at 07:48:28PM +0100, Christoph Hellwig wrote: > fsync is required for data integrity as there is no gurantee that > data makes it to disk at any specified time without it. Even for > ext3 with data=ordered mode the file system will only commit all > data at some point in time that is not guaranteed. > > I've lost data on development machines with various times countless > times due to the lack of this option, and now lost trees on a > git server with ext4 as well yesterday. It's time to make git > safe by default. I'm definitely sympathetic, and I've contemplated a patch like this a few times. But I'm not sure we're "safe by default" here after this patch. In particular: 1. This covers only loose objects. We generally sync pack writes already, so we're covered there. But we do not sync ref updates at all, which we'd probably want to in a default-safe setup (a common post-crash symptom I've seen is zero-length ref files). 2. Is it sufficient to fsync() the individual file's descriptors? We often do other filesystem operations (like hardlinking or renaming) that also need to be committed to disk before an operation can be considered saved. 3. Related to (2), we often care about the order of metadata commits. E.g., a common sequence is: a. Write object contents to tempfile. b. rename() or hardlink tempfile to final name. c. Write object name into ref.lock file. d. rename() ref.lock to ref If we see (d) but not (b), then the result is a corrupted repository. Is this guaranteed by ext4 journaling with data=ordered? It may be that data=ordered gets us what we need for (2) and (3). But I think at the very least we should consider fsyncing ref updates based on a config option, too. -Peff
On Wed, Jan 17, 2018 at 03:55:09PM -0500, Jeff King wrote: > I'm definitely sympathetic, and I've contemplated a patch like this a > few times. But I'm not sure we're "safe by default" here after this > patch. In particular: > > 1. This covers only loose objects. We generally sync pack writes > already, so we're covered there. But we do not sync ref updates at > all, which we'd probably want to in a default-safe setup (a common > post-crash symptom I've seen is zero-length ref files). I've not seen them myself yet, but yes, they need an fsync. > 2. Is it sufficient to fsync() the individual file's descriptors? > We often do other filesystem operations (like hardlinking or > renaming) that also need to be committed to disk before an > operation can be considered saved. No, for metadata operations we need to fsync the directory as well. > 3. Related to (2), we often care about the order of metadata commits. > E.g., a common sequence is: > > a. Write object contents to tempfile. > > b. rename() or hardlink tempfile to final name. > > c. Write object name into ref.lock file. > > d. rename() ref.lock to ref > > If we see (d) but not (b), then the result is a corrupted > repository. Is this guaranteed by ext4 journaling with > data=ordered? It is not generally guranteed by Linux file system semantics. Various file system will actually start writeback of file data before rename, but not actually wait on it.
On Wed, Jan 17 2018, Junio C. Hamano jotted: > Christoph Hellwig <hch@lst.de> writes: > >> fsync is required for data integrity as there is no gurantee that >> data makes it to disk at any specified time without it. Even for >> ext3 with data=ordered mode the file system will only commit all >> data at some point in time that is not guaranteed. > > It comes from this one: > > commit aafe9fbaf4f1d1f27a6f6e3eb3e246fff81240ef > Author: Linus Torvalds <torvalds@linux-foundation.org> > Date: Wed Jun 18 15:18:44 2008 -0700 > > Add config option to enable 'fsync()' of object files > > As explained in the documentation[*] this is totally useless on > filesystems that do ordered/journalled data writes, but it can be a > useful safety feature on filesystems like HFS+ that only journal the > metadata, not the actual file contents. > > It defaults to off, although we could presumably in theory some day > auto-enable it on a per-filesystem basis. > > [*] Yes, I updated the docs for the thing. Hell really _has_ frozen > over, and the four horsemen are probably just beyond the horizon. > EVERYBODY PANIC! > >> diff --git a/Documentation/config.txt b/Documentation/config.txt >> index 0e25b2c92..9a1cec5c8 100644 >> --- a/Documentation/config.txt >> +++ b/Documentation/config.txt >> @@ -866,10 +866,8 @@ core.whitespace:: >> core.fsyncObjectFiles:: >> This boolean will enable 'fsync()' when writing object files. >> + >> -This is a total waste of time and effort on a filesystem that orders >> -data writes properly, but can be useful for filesystems that do not use >> -journalling (traditional UNIX filesystems) or that only journal metadata >> -and not file contents (OS X's HFS+, or Linux ext3 with "data=writeback"). >> +This option is enabled by default and ensures actual data integrity >> +by calling fsync after writing object files. > > I am somewhat sympathetic to the desire to flip the default to > "safe" and allow those who know they are already safe to tweak the > knob for performance, and it also makes sense to document that the > default is "true" here. But I do not see the point of removing the > four lines from this paragraph; the sole effect of the removal is to > rob information from readers that they can use to decide if they > want to disable the configuration, no? [CC'd the author of the current behavior] Some points/questions: a) Is there some reliable way to test whether this is needed from userspace? I'm thinking something like `git update-index --test-untracked-cache` but for fsync(). b) On the filesystems that don't need this, what's the performance impact? I ran a small test myself on CentOS 7 (3.10) with ext4 data=ordered on the tests I thought might do a lot of loose object writes: $ GIT_PERF_REPEAT_COUNT=10 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_OPTS="NO_OPENSSL=Y CFLAGS=-O3 -j56" ./run origin/master fsync-on~ fsync-on p3400-rebase.sh p0007-write-cache.sh [...] Test fsync-on~ fsync-on ------------------------------------------------------------------------------------------------------- 3400.2: rebase on top of a lot of unrelated changes 1.45(1.30+0.17) 1.45(1.28+0.20) +0.0% 3400.4: rebase a lot of unrelated changes without split-index 4.34(3.71+0.66) 4.33(3.69+0.66) -0.2% 3400.6: rebase a lot of unrelated changes with split-index 3.38(2.94+0.47) 3.38(2.93+0.47) +0.0% 0007.2: write_locked_index 3 times (3214 files) 0.01(0.00+0.00) 0.01(0.00+0.00) +0.0% No impact. However I did my own test of running the test suite 10% times with/without this patch, and it runs 9% slower: fsync-off: avg:21.59 21.50 21.50 21.52 21.53 21.54 21.57 21.59 21.61 21.63 21.95 fsync-on: avg:23.43 23.21 23.25 23.26 23.26 23.27 23.32 23.49 23.51 23.83 23.88 Test script at the end of this E-Mail. c) What sort of guarantees in this regard do NFS-mounted filesystems commonly make? Test script: use v5.10.0; use strict; use warnings; use Time::HiRes qw(time); use List::Util qw(sum); use Data::Dumper; my %time; for my $ref (@ARGV) { system "git checkout $ref"; system qq[make -j56 CFLAGS="-O3 -g" NO_OPENSSL=Y all]; for (1..10) { my $t0 = -time(); system "(cd t && NO_SVN_TESTS=1 GIT_TEST_HTTPD=0 prove -j56 --state=slow,save t[0-9]*.sh)"; $t0 += time(); push @{$time{$ref}} => $t0; } } for my $ref (sort keys %time) { printf "%20s: avg:%.2f %s\n", $ref, sum(@{$time{$ref}})/@{$time{$ref}}, join(" ", map { sprintf "%.02f", $_ } sort { $a <=> $b } @{$time{$ref}}); }
On Wed, Jan 17, 2018 at 1:44 PM, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > > I ran a small test myself on CentOS 7 (3.10) with ext4 data=ordered > on the tests I thought might do a lot of loose object writes: > > $ GIT_PERF_REPEAT_COUNT=10 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_OPTS="NO_OPENSSL=Y CFLAGS=-O3 -j56" ./run origin/master fsync-on~ fsync-on p3400-rebase.sh p0007-write-cache.sh > [...] > Test fsync-on~ fsync-on > ------------------------------------------------------------------------------------------------------- > 3400.2: rebase on top of a lot of unrelated changes 1.45(1.30+0.17) 1.45(1.28+0.20) +0.0% > 3400.4: rebase a lot of unrelated changes without split-index 4.34(3.71+0.66) 4.33(3.69+0.66) -0.2% > 3400.6: rebase a lot of unrelated changes with split-index 3.38(2.94+0.47) 3.38(2.93+0.47) +0.0% > 0007.2: write_locked_index 3 times (3214 files) 0.01(0.00+0.00) 0.01(0.00+0.00) +0.0% > > No impact. However I did my own test of running the test suite 10% > times with/without this patch, and it runs 9% slower: > > fsync-off: avg:21.59 21.50 21.50 21.52 21.53 21.54 21.57 21.59 21.61 21.63 21.95 > fsync-on: avg:23.43 23.21 23.25 23.26 23.26 23.27 23.32 23.49 23.51 23.83 23.88 That's not the thing you should check. Now re-do the test while another process writes to a totally unrelated a huge file (say, do a ISO file copy or something). That was the thing that several filesystems get completely and horribly wrong. Generally _particularly_ the logging filesystems that don't even need the fsync, because they use a single log for everything (so fsync serializes all the writes, not just the writes to the one file it's fsync'ing). The original git design was very much to write each object file without any syncing, because they don't matter since a new object file - by definition - isn't really reachable. Then sync before writing the index file or a new ref. But things have changed, I'm not arguing that the code shouldn't be made safe by default. I personally refuse to use rotating media on my machines anyway, largely exactly because of the fsync() issue (things like "firefox" started doing fsync on the mysql database for stupid things, and you'd get huge pauses). But I do think your benchmark is wrong. The case where only git does something is not interesting or relevant. It really is "git does something _and_ somebody else writes something entirely unrelated at the same time" that matters. Linus
On Wed, Jan 17, 2018 at 2:07 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > The original git design was very much to write each object file > without any syncing, because they don't matter since a new object file > - by definition - isn't really reachable. Then sync before writing the > index file or a new ref. .. actually, I think it originally sync'ed only before starting to actually remove objects due to repacking. The theory was that you can lose the last commit series or whatever, and have to git-fsck, and have to re-do it, but you could never lose long-term work. If your machine crashes, you still remember what you did just before the crash. That theory may have been more correct back in the days than it is now. People who use git might be less willing to treat it like a filesystem that you can fsck than I was back 10+ ago. It's worth noting that the commit that Junio pointed to (from 2008) didn't actually change any behavior. It just allowed people who cared to change behavior. The original "let's not waste time on fsync every object write, because we can just re-create the state anyway" behavior goes back to 2005. Linus
On Wed, Jan 17 2018, Linus Torvalds jotted: > On Wed, Jan 17, 2018 at 1:44 PM, Ævar Arnfjörð Bjarmason > <avarab@gmail.com> wrote: >> >> I ran a small test myself on CentOS 7 (3.10) with ext4 data=ordered >> on the tests I thought might do a lot of loose object writes: >> >> $ GIT_PERF_REPEAT_COUNT=10 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_OPTS="NO_OPENSSL=Y CFLAGS=-O3 -j56" ./run origin/master fsync-on~ fsync-on p3400-rebase.sh p0007-write-cache.sh >> [...] >> Test fsync-on~ fsync-on >> ------------------------------------------------------------------------------------------------------- >> 3400.2: rebase on top of a lot of unrelated changes 1.45(1.30+0.17) 1.45(1.28+0.20) +0.0% >> 3400.4: rebase a lot of unrelated changes without split-index 4.34(3.71+0.66) 4.33(3.69+0.66) -0.2% >> 3400.6: rebase a lot of unrelated changes with split-index 3.38(2.94+0.47) 3.38(2.93+0.47) +0.0% >> 0007.2: write_locked_index 3 times (3214 files) 0.01(0.00+0.00) 0.01(0.00+0.00) +0.0% >> >> No impact. However I did my own test of running the test suite 10% >> times with/without this patch, and it runs 9% slower: That should be "10 times" b.t.w., not "10% times" >> >> fsync-off: avg:21.59 21.50 21.50 21.52 21.53 21.54 21.57 21.59 21.61 21.63 21.95 >> fsync-on: avg:23.43 23.21 23.25 23.26 23.26 23.27 23.32 23.49 23.51 23.83 23.88 > > That's not the thing you should check. > > Now re-do the test while another process writes to a totally unrelated > a huge file (say, do a ISO file copy or something). > > That was the thing that several filesystems get completely and > horribly wrong. Generally _particularly_ the logging filesystems that > don't even need the fsync, because they use a single log for > everything (so fsync serializes all the writes, not just the writes to > the one file it's fsync'ing). > > The original git design was very much to write each object file > without any syncing, because they don't matter since a new object file > - by definition - isn't really reachable. Then sync before writing the > index file or a new ref. >> > But things have changed, I'm not arguing that the code shouldn't be > made safe by default. I personally refuse to use rotating media on my > machines anyway, largely exactly because of the fsync() issue (things > like "firefox" started doing fsync on the mysql database for stupid > things, and you'd get huge pauses). > > But I do think your benchmark is wrong. The case where only git does > something is not interesting or relevant. It really is "git does > something _and_ somebody else writes something entirely unrelated at > the same time" that matters. Yeah it's shitty, just a quick hack to get some since there was a discussion of performance, but neither your original patch or this thread had quoted any. One thing you may have missed is that this is a parallel (56 tests at a time) run of the full test suite. So there's plenty of other git processes (and test setup/teardown) racing with any given git process. Running the test suite in a loop like this gives me ~100K IO ops/s & ~50% disk utilization. Or does overall FS activity and raw throughput (e.g. with an ISO copy) matter more than general FS contention? Tweaking it to emulate this iso copy case, running another test with one of these running concurrently: # setup dd if=/dev/urandom of=/tmp/fake.iso bs=1024 count=$((1000*1024)) # run in a loop (shuf to not always write the same thing) while sleep 0.1; do shuf /tmp/fake.iso | pv >/tmp/fake.shuf.iso; done Gives throughput that spikes to 100% (not consistently) and: fsync-off: avg:36.37 31.74 33.83 35.12 36.19 36.32 37.04 37.34 37.71 37.93 40.43 fsync-on: avg:38.09 34.56 35.14 35.69 36.41 36.41 37.96 38.25 40.45 41.44 44.59 ~4.7% slower, v.s. ~8.5% in my earlier 87h8rki2iu.fsf@evledraar.gmail.com without that running. Which is not an argument for / against this patch, but those numbers seem significant, and generally if the entire test suite slows down by that much there's going to be sub-parts of it that are much worse. Which might be a reason to tread more carefully and if it *does* slow things down perhaps do it with more granularity, e.g. turning it on in git-receive-pack might be more sensible than in git-filter-branch. I remember Roberto Tyley's BFG writing an amazing amount of loose objects, but it doesn't seem to have an fsync() option, I wonder if adding one would be a representative pathological test case.
On Wed, Jan 17, 2018 at 3:16 PM, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > > Or does overall FS activity and raw throughput (e.g. with an ISO copy) > matter more than general FS contention? Traditionally, yes. Also note that none of this is about "throughput". It's about waiting for a second or two when you do a simple "git commit" or similar. Also, hardware has changed. I haven't used a rotational disk in about 10 years now, and I don't worry about latencies so much more. The fact that you get ~100k iops indicates that you probably aren't using those stupid platters of rust either. So I doubt you can even trigger the bad cases. Linus
On Wed, Jan 17, 2018 at 02:07:22PM -0800, Linus Torvalds wrote: > > Now re-do the test while another process writes to a totally unrelated > a huge file (say, do a ISO file copy or something). > > That was the thing that several filesystems get completely and > horribly wrong. Generally _particularly_ the logging filesystems that > don't even need the fsync, because they use a single log for > everything (so fsync serializes all the writes, not just the writes to > the one file it's fsync'ing). Well, let's be fair; this is something *ext3* got wrong, and it was the default file system back them. All of the modern file systems now do delayed allocation, which means that an fsync of one file doesn't actually imply an fsync of another file. Hence... > The original git design was very much to write each object file > without any syncing, because they don't matter since a new object file > - by definition - isn't really reachable. Then sync before writing the > index file or a new ref. This isn't really safe any more. Yes, there's a single log. But files which are subject to delayed allocation are in the page cache, and just because you fsync the index file doesn't mean that the object file is now written to disk. It was true for ext3, but it's not true for ext4, xfs, btrfs, etc. The good news is that if you have another process downloading a huge ISO image, the fsync of the index file won't force the ISO file to be written out. The bad news is that it won't force out the other git object files, either. Now, there is a potential downside of fsync'ing each object file, and that is the cost of doing a CACHE FLUSH on a HDD is non-trivial, and even on a SSD, it's not optimal to call CACHE FLUSH thousands of times in a second. So if you are creating thousands of tiny files, and you fsync each one, each fsync(2) call is a serializing instruction, which means it won't return until that one file is written to disk. If you are writing lots of small files, and you are using a HDD, you'll be bottlenecked to around 30 files per second on a 5400 RPM HDD, and this is true regardless of what file system you use, because the bottle neck is the CACHE FLUSH operation, and how you organize the metadata and how you do the block allocation, is largely lost in the noise compared to the CACHE FLUSH command, which serializes everything. There are solutions to this; you could simply not call fsync(2) a thousand times, and instead write a pack file, and call fsync once on the pack file. That's probably the smartest approach. You could also create a thousand threads, and call fsync(2) on those thousand threads at roughly the same time. Or you could use a bleeding edge kernel with the latest AIO patch, and use the newly added IOCB_CMD_FSYNC support. But I'd simply recommend writing a pack and fsync'ing the pack, instead of trying to write a gazillion object files. (git-repack -A, I'm looking at you....) - Ted
On Wed, Jan 17, 2018 at 3:52 PM, Theodore Ts'o <tytso@mit.edu> wrote: > > Well, let's be fair; this is something *ext3* got wrong, and it was > the default file system back them. I'm pretty sure reiserfs and btrfs did too.. Linus
[adding Chris to the Cc list - this is about the awful ext3 data=ordered behavior of syncing the whole file system data and metadata on each fsync] On Wed, Jan 17, 2018 at 03:57:53PM -0800, Linus Torvalds wrote: > On Wed, Jan 17, 2018 at 3:52 PM, Theodore Ts'o <tytso@mit.edu> wrote: > > > > Well, let's be fair; this is something *ext3* got wrong, and it was > > the default file system back them. > > I'm pretty sure reiserfs and btrfs did too.. I'm pretty sure btrfs never did, and reiserfs at least looks like it currently doesn't but I'd have to dig into history to check if it ever did.
Christoph Hellwig <hch@lst.de> writes: > [adding Chris to the Cc list - this is about the awful ext3 data=ordered > behavior of syncing the whole file system data and metadata on each > fsync] > > On Wed, Jan 17, 2018 at 03:57:53PM -0800, Linus Torvalds wrote: >> On Wed, Jan 17, 2018 at 3:52 PM, Theodore Ts'o <tytso@mit.edu> wrote: >> > >> > Well, let's be fair; this is something *ext3* got wrong, and it was >> > the default file system back them. >> >> I'm pretty sure reiserfs and btrfs did too.. > > I'm pretty sure btrfs never did, and reiserfs at least looks like > it currently doesn't but I'd have to dig into history to check if > it ever did. So..., is it fair to say that the one you sent in https://public-inbox.org/git/20180117193510.GA30657@lst.de/ is the best variant we have seen in this thread so far? I'll keep that in my inbox so that I do not forget, but I think we would want to deal with a hotfix for 2.16 on case insensitive platforms before this topic. Thanks.
On Fri, Jan 19, 2018 at 11:08:46AM -0800, Junio C Hamano wrote: > So..., is it fair to say that the one you sent in > > https://public-inbox.org/git/20180117193510.GA30657@lst.de/ > > is the best variant we have seen in this thread so far? I'll keep > that in my inbox so that I do not forget, but I think we would want > to deal with a hotfix for 2.16 on case insensitive platforms before > this topic. It's a simplistic fix, but it will work. There may very well be certain workloads which generate a large number of loose objects (e.g., git repack -A) which will make things go significantly more slowly as a result. It might very well be the case that if nothing else is going on, something like "write all the files without fsync(2), then use syncfs(2)" would be much faster. The downside with that approach is if indeed you were downloading a multi-gigabyte DVD image at the same time, the syncfs(2) will force a writeback of the partially writte DVD image, or some other unrelated files. But if the goal is to just change the default, and then see what shakes out, and then apply other optimizations later, that's certainly a valid result. I've never been fond of the "git repack -A" behavior where it can generate huge numbers of loose files. I'd much prefer it if the other objects ended up in a separate pack file, and then some other provision made for nuking that pack file some time later. But that's expanding the scope significantly over what's currently being discussed. - Ted
Theodore Ts'o <tytso@mit.edu> writes: > .... I've never been fond of the "git repack -A" behavior > where it can generate huge numbers of loose files. I'd much prefer it > if the other objects ended up in a separate pack file, and then some > other provision made for nuking that pack file some time later.... Yes, a "cruft pack" that holds unreachable object has been discussed a few times recently on list, and I do agree that it is a desirable thing to have in the longer run. What's tricky is to devise a way to allow us to salvage objects that are placed in a cruft pack because they are accessed recently, proving themselves to be no longer crufts. It could be that a good way to resurrect them is to explode them to loose form when they are accessed out of a cruft pack. We need to worry about interactions with read-only users if we go that route, but with the current "explode unreachable to loose, touch their mtime when they are accessed" scheme ends up ignoring accesses from read-only users that cannot update mtime, so it might not be too bad.
On 01/18/2018 11:27 AM, Christoph Hellwig wrote: > [adding Chris to the Cc list - this is about the awful ext3 data=ordered > behavior of syncing the whole file system data and metadata on each > fsync] > > On Wed, Jan 17, 2018 at 03:57:53PM -0800, Linus Torvalds wrote: >> On Wed, Jan 17, 2018 at 3:52 PM, Theodore Ts'o <tytso@mit.edu> wrote: >>> >>> Well, let's be fair; this is something *ext3* got wrong, and it was >>> the default file system back them. >> >> I'm pretty sure reiserfs and btrfs did too.. > > I'm pretty sure btrfs never did, and reiserfs at least looks like > it currently doesn't but I'd have to dig into history to check if > it ever did. > Christoph has this right, btrfs only fsyncs the one file that you've asked for, and unrelated data/metadata won't be included. We've seen big fsync stalls on ext4 caused by data=ordered, so it's still possible to trigger on ext4, but much better than ext3. I do share Ted's concern about the perf impact of the fsyncs on tons of loose files, but the defaults should be safety first. -chris
On Sat, Jan 20 2018, Junio C. Hamano jotted: > Theodore Ts'o <tytso@mit.edu> writes: > >> .... I've never been fond of the "git repack -A" behavior >> where it can generate huge numbers of loose files. I'd much prefer it >> if the other objects ended up in a separate pack file, and then some >> other provision made for nuking that pack file some time later.... > > Yes, a "cruft pack" that holds unreachable object has been discussed > a few times recently on list, and I do agree that it is a desirable > thing to have in the longer run. > > What's tricky is to devise a way to allow us to salvage objects that > are placed in a cruft pack because they are accessed recently, > proving themselves to be no longer crufts. It could be that a good > way to resurrect them is to explode them to loose form when they are > accessed out of a cruft pack. We need to worry about interactions > with read-only users if we go that route, but with the current > "explode unreachable to loose, touch their mtime when they are > accessed" scheme ends up ignoring accesses from read-only users that > cannot update mtime, so it might not be too bad. Wouldn't it also make gc pruning more expensive? Now you can repack regularly and loose objects will be left out of the pack, and then just rm'd, whereas now it would entail creating new packs (unless the whole pack was objects meant for removal). Probably still worth it, but something to keep in mind.
On Mon, Jan 22, 2018 at 04:09:23PM +0100, Ævar Arnfjörð Bjarmason wrote: > > What's tricky is to devise a way to allow us to salvage objects that > > are placed in a cruft pack because they are accessed recently, > > proving themselves to be no longer crufts. It could be that a good > > way to resurrect them is to explode them to loose form when they are > > accessed out of a cruft pack. We need to worry about interactions > > with read-only users if we go that route, but with the current > > "explode unreachable to loose, touch their mtime when they are > > accessed" scheme ends up ignoring accesses from read-only users that > > cannot update mtime, so it might not be too bad. > > Wouldn't it also make gc pruning more expensive? Now you can repack > regularly and loose objects will be left out of the pack, and then just > rm'd, whereas now it would entail creating new packs (unless the whole > pack was objects meant for removal). The idea is that the cruft pack would be all objects that were no longer referenced. Hence the proposal that if they ever *are* accessed, they would be exploded to a loose object at that point. So in the common case, the GC would go quickly since the entire pack could just be rm'ed once it hit the designated expiry time. Another way of doing things would be to use the mtime of the cruft pack for the expiry time, and if the curft pack is ever referenced, its mtime would get updated. Yet a third way would be to simply clear the "cruft" bit if it ever *is* referenced. In the common case, it would never be referenced, so it could just get deleted, but in the case where the user has manually "rescued" a set of commits (perhaps by explicitly setting a branch head to commit id found from a reflog), the objects would be saved. So there are many ways it could be managed. - Ted
On Mon, Jan 22, 2018 at 04:09:23PM +0100, Ævar Arnfjörð Bjarmason wrote: > > Yes, a "cruft pack" that holds unreachable object has been discussed > > a few times recently on list, and I do agree that it is a desirable > > thing to have in the longer run. > > > > What's tricky is to devise a way to allow us to salvage objects that > > are placed in a cruft pack because they are accessed recently, > > proving themselves to be no longer crufts. It could be that a good > > way to resurrect them is to explode them to loose form when they are > > accessed out of a cruft pack. We need to worry about interactions > > with read-only users if we go that route, but with the current > > "explode unreachable to loose, touch their mtime when they are > > accessed" scheme ends up ignoring accesses from read-only users that > > cannot update mtime, so it might not be too bad. > > Wouldn't it also make gc pruning more expensive? Now you can repack > regularly and loose objects will be left out of the pack, and then just > rm'd, whereas now it would entail creating new packs (unless the whole > pack was objects meant for removal). > > Probably still worth it, but something to keep in mind. That's a good point. I think it would be OK in practice, though, since normal operations don't tend to create a huge number of unreachable loose objects (at least compared to the _reachable_ loose objects, which we're already dealing with). We tend to get unbalanced explosions of loose objects only because a huge chunk of packed history expired. It is something to keep in mind when implementing the scheme, though. Luckily we already have the right behavior implemented via --pack-loose-unreachable (which is used for "repack -k" currently), so I think it would just be a matter of passing the right flags from git-repack. -Peff
On Mon, Jan 22, 2018 at 01:09:03PM -0500, Theodore Ts'o wrote: > > Wouldn't it also make gc pruning more expensive? Now you can repack > > regularly and loose objects will be left out of the pack, and then just > > rm'd, whereas now it would entail creating new packs (unless the whole > > pack was objects meant for removal). > > The idea is that the cruft pack would be all objects that were no > longer referenced. Hence the proposal that if they ever *are* > accessed, they would be exploded to a loose object at that point. So > in the common case, the GC would go quickly since the entire pack > could just be rm'ed once it hit the designated expiry time. I think Ævar is talking about the case of: 1. You make 100 objects that aren't referenced. They're loose. 2. You run git-gc. They're still too recent to be deleted. Right now those recent loose objects sit loose, and have zero cost at the time of gc. In a "cruft pack" world, you'd pay some I/O to copy them into the cruft pack, and some CPU to zlib and delta-compress them. I think that's probably fine, though. That said, some of what you wrote left me confused, and whether we're all talking about the same idea. ;) Let me describe the idea I had mentioned in another thread. Right now the behavior is basically this: If an unreachable object becomes referenced, it doesn't immediately get exploded. During the next gc, whatever new object referenced them would be one of: 1. Reachable from refs, in which case it carries along the formerly-cruft object into the new pack, since it is now also reachable. 2. Unreachable but still recent by mtime; we keep such objects, and anything they reference (now as unreachable, in this proposal in the cruft pack). Now these get either left loose, or exploded loose if they were previously packed. 3. Unreachable and old. Both objects can be dropped totally. The current strategy is to use the mtimes for "recent", and we use the pack's mtime for every object in the pack. So if we pack all the loose objects into a cruft pack, the mtime of the cruft pack becomes the new gauge for "recent". And if we migrate objects from old cruft pack to new cruft pack at each gc, then they'll keep getting their mtimes refreshed, and we'll never drop them. So we need to either: - keep per-object mtimes, so that old ones can age out (i.e., they'd hit case 3 and just not get migrated to either the new "real" pack or the new cruft pack). - keep multiple cruft packs, and let whole packs age out. But then cruft objects which get referenced again by other cruft have to get copied (not moved!) to new packs. That _probably_ doesn't happen all that often, so it might be OK. > Another way of doing things would be to use the mtime of the cruft > pack for the expiry time, and if the curft pack is ever referenced, > its mtime would get updated. Yet a third way would be to simply clear > the "cruft" bit if it ever *is* referenced. In the common case, it > would never be referenced, so it could just get deleted, but in the > case where the user has manually "rescued" a set of commits (perhaps > by explicitly setting a branch head to commit id found from a reflog), > the objects would be saved. I don't think we have to worry about "rescued" objects. Those are reachable, so they'd get copied into the new "real" pack (and then their cruft pack eventually deleted). -Peff
On Mon, Jan 22, 2018 at 07:47:10PM -0500, Jeff King wrote: > > I think Ævar is talking about the case of: > > 1. You make 100 objects that aren't referenced. They're loose. > > 2. You run git-gc. They're still too recent to be deleted. > > Right now those recent loose objects sit loose, and have zero cost at > the time of gc. In a "cruft pack" world, you'd pay some I/O to copy > them into the cruft pack, and some CPU to zlib and delta-compress them. > I think that's probably fine, though. I wasn't assuming that git-gc would create a cruft pack --- although I guess it could. As you say, recent loose objects have relatively zero cost at the time of gc. To the extent that the gc has to read lots of loose files, there may be more seeks in the cold cache case, so there is actually *some* cost to having the loose objects, but it's not great. What I was thinking about instead is that in cases where we know we are likely to be creating a large number of loose objects (whether they referenced or not), in a world where we will be calling fsync(2) after every single loose object being created, pack files start looking *way* more efficient. So in general, if you know you will be creating N loose objects, where N is probably around 50 or so, you'll want to create a pack instead. One of those cases is "repack -A", and in that case the loose objects are all going tobe not referenced, so it would be a "cruft pack". But in many other cases where we might be importing from another DCVS, which will be another case where doing an fsync(2) after every loose object creation (and where I have sometimes seen it create them *all* loose, and not use a pack at all), is going to get extremely slow and painful. > So if we pack all the loose objects into a cruft pack, the mtime of the > cruft pack becomes the new gauge for "recent". And if we migrate objects > from old cruft pack to new cruft pack at each gc, then they'll keep > getting their mtimes refreshed, and we'll never drop them. Well, I was assuming that gc would be a special case which doesn't the mtime of the old cruft pack. (Or more generally, any time an object is gets copied out of the cruft pack, either to a loose object, or to another pack, the mtime on the source pack should not be touched.) - Ted
On Tue, Jan 23, 2018 at 12:45:53AM -0500, Theodore Ts'o wrote: > What I was thinking about instead is that in cases where we know we > are likely to be creating a large number of loose objects (whether > they referenced or not), in a world where we will be calling fsync(2) > after every single loose object being created, pack files start > looking *way* more efficient. So in general, if you know you will be > creating N loose objects, where N is probably around 50 or so, you'll > want to create a pack instead. > > One of those cases is "repack -A", and in that case the loose objects > are all going tobe not referenced, so it would be a "cruft pack". But > in many other cases where we might be importing from another DCVS, > which will be another case where doing an fsync(2) after every loose > object creation (and where I have sometimes seen it create them *all* > loose, and not use a pack at all), is going to get extremely slow and > painful. Ah, I see. I think in the general case of git operations this is hard (because most object writes don't realize the larger operation that they're a part of). But I agree that those two are the low-hanging fruit (imports should already be using fast-import, and "cruft packs" are not too hard an idea to implement). I agree that a cruft-pack implementation could just be for "repack -A", and does not have to collect otherwise loose objects. I think part of my confusion was that you and I are coming to the idea from different angles: you care about minimizing fsyncs, and I'm interested in stopping the problem where you have too many loose objects after running auto-gc. So I care more about collecting those loose objects for that case. > > So if we pack all the loose objects into a cruft pack, the mtime of the > > cruft pack becomes the new gauge for "recent". And if we migrate objects > > from old cruft pack to new cruft pack at each gc, then they'll keep > > getting their mtimes refreshed, and we'll never drop them. > > Well, I was assuming that gc would be a special case which doesn't the > mtime of the old cruft pack. (Or more generally, any time an object > is gets copied out of the cruft pack, either to a loose object, or to > another pack, the mtime on the source pack should not be touched.) Right, that's the "you have multiple cruft packs" idea which has been discussed[1] (each one just hangs around until its mtime expires, and may duplicate objects found elsewhere). That does end up with one pack per gc, which just punts the "too many loose objects" to "too many packs". But unless the number of gc runs you do is very high compared to the expiration time, we can probably ignore that. -Peff [1] https://public-inbox.org/git/20170610080626.sjujpmgkli4muh7h@sigill.intra.peff.net/
On Thu, Jan 18 2018, Theodore Ts'o wrote: > On Wed, Jan 17, 2018 at 02:07:22PM -0800, Linus Torvalds wrote: >> >> Now re-do the test while another process writes to a totally unrelated >> a huge file (say, do a ISO file copy or something). >> >> That was the thing that several filesystems get completely and >> horribly wrong. Generally _particularly_ the logging filesystems that >> don't even need the fsync, because they use a single log for >> everything (so fsync serializes all the writes, not just the writes to >> the one file it's fsync'ing). > > Well, let's be fair; this is something *ext3* got wrong, and it was > the default file system back them. All of the modern file systems now > do delayed allocation, which means that an fsync of one file doesn't > actually imply an fsync of another file. Hence... > >> The original git design was very much to write each object file >> without any syncing, because they don't matter since a new object file >> - by definition - isn't really reachable. Then sync before writing the >> index file or a new ref. > > This isn't really safe any more. Yes, there's a single log. But > files which are subject to delayed allocation are in the page cache, > and just because you fsync the index file doesn't mean that the object > file is now written to disk. It was true for ext3, but it's not true > for ext4, xfs, btrfs, etc. > > The good news is that if you have another process downloading a huge > ISO image, the fsync of the index file won't force the ISO file to be > written out. The bad news is that it won't force out the other git > object files, either. > > Now, there is a potential downside of fsync'ing each object file, and > that is the cost of doing a CACHE FLUSH on a HDD is non-trivial, and > even on a SSD, it's not optimal to call CACHE FLUSH thousands of times > in a second. So if you are creating thousands of tiny files, and you > fsync each one, each fsync(2) call is a serializing instruction, which > means it won't return until that one file is written to disk. If you > are writing lots of small files, and you are using a HDD, you'll be > bottlenecked to around 30 files per second on a 5400 RPM HDD, and this > is true regardless of what file system you use, because the bottle > neck is the CACHE FLUSH operation, and how you organize the metadata > and how you do the block allocation, is largely lost in the noise > compared to the CACHE FLUSH command, which serializes everything. > > There are solutions to this; you could simply not call fsync(2) a > thousand times, and instead write a pack file, and call fsync once on > the pack file. That's probably the smartest approach. > > You could also create a thousand threads, and call fsync(2) on those > thousand threads at roughly the same time. Or you could use a > bleeding edge kernel with the latest AIO patch, and use the newly > added IOCB_CMD_FSYNC support. > > But I'd simply recommend writing a pack and fsync'ing the pack, > instead of trying to write a gazillion object files. (git-repack -A, > I'm looking at you....) > > - Ted [I didn't find an ideal message to reply to in this thread, but this seemed to probably be the best] Just an update on this since I went back and looked at this thread, GitLab about ~1yr ago turned on core.fsyncObjectFiles=true by default. The reason is detailed in [1], tl;dr: empty loose object file issue on ext4 allegedly caused by a lack of core.fsyncObjectFiles=true, but I didn't do any root cause analysis. Just noting it here for for future reference. 1. https://gitlab.com/gitlab-org/gitlab-foss/-/issues/51680#note_180508774
On Thu, Sep 17, 2020 at 01:06:50PM +0200, Ævar Arnfjörð Bjarmason wrote: > The reason is detailed in [1], tl;dr: empty loose object file issue on > ext4 allegedly caused by a lack of core.fsyncObjectFiles=true, but I > didn't do any root cause analysis. Just noting it here for for future > reference. All the modern Linux file systems first write the data, and then only write the metadata after the data write has finished. So your data might have been partially or fully on disk, but until the transaction to commit the size change and/or extent state change you're not going to be able to read it back.
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > [I didn't find an ideal message to reply to in this thread, but this > seemed to probably be the best] > > Just an update on this since I went back and looked at this thread, > GitLab about ~1yr ago turned on core.fsyncObjectFiles=true by > default. > > The reason is detailed in [1], tl;dr: empty loose object file issue on > ext4 allegedly caused by a lack of core.fsyncObjectFiles=true, but I > didn't do any root cause analysis. Just noting it here for for future > reference. > > 1. https://gitlab.com/gitlab-org/gitlab-foss/-/issues/51680#note_180508774 Thanks for bringing the original discussion back. I do recall the discussion and the list of issues to think about raised by Peff back then in [2] is still relevant. 2. https://public-inbox.org/git/20180117205509.GA14828@sigill.intra.peff.net/
diff --git a/Documentation/config.txt b/Documentation/config.txt index 0e25b2c92..9a1cec5c8 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -866,10 +866,8 @@ core.whitespace:: core.fsyncObjectFiles:: This boolean will enable 'fsync()' when writing object files. + -This is a total waste of time and effort on a filesystem that orders -data writes properly, but can be useful for filesystems that do not use -journalling (traditional UNIX filesystems) or that only journal metadata -and not file contents (OS X's HFS+, or Linux ext3 with "data=writeback"). +This option is enabled by default and ensures actual data integrity +by calling fsync after writing object files. core.preloadIndex:: Enable parallel index preload for operations like 'git diff' diff --git a/environment.c b/environment.c index 63ac38a46..c74375b5e 100644 --- a/environment.c +++ b/environment.c @@ -36,7 +36,7 @@ const char *git_hooks_path; int zlib_compression_level = Z_BEST_SPEED; int core_compression_level; int pack_compression_level = Z_DEFAULT_COMPRESSION; -int fsync_object_files; +int fsync_object_files = 1; size_t packed_git_window_size = DEFAULT_PACKED_GIT_WINDOW_SIZE; size_t packed_git_limit = DEFAULT_PACKED_GIT_LIMIT; size_t delta_base_cache_limit = 96 * 1024 * 1024;
fsync is required for data integrity as there is no gurantee that data makes it to disk at any specified time without it. Even for ext3 with data=ordered mode the file system will only commit all data at some point in time that is not guaranteed. I've lost data on development machines with various times countless times due to the lack of this option, and now lost trees on a git server with ext4 as well yesterday. It's time to make git safe by default. Signed-off-by: Christoph Hellwig <hch@lst.de> --- Documentation/config.txt | 6 ++---- environment.c | 2 +- 2 files changed, 3 insertions(+), 5 deletions(-)