enable core.fsyncObjectFiles by default

Message ID	20180117184828.31816-1-hch@lst.de (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Christoph Hellwig <hch@lst.de> To: git@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Subject: [PATCH] enable core.fsyncObjectFiles by default Date: Wed, 17 Jan 2018 19:48:28 +0100 Message-Id: <20180117184828.31816-1-hch@lst.de> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk

Christoph Hellwig Jan. 17, 2018, 6:48 p.m. UTC

fsync is required for data integrity as there is no gurantee that
data makes it to disk at any specified time without it.  Even for
ext3 with data=ordered mode the file system will only commit all
data at some point in time that is not guaranteed.

I've lost data on development machines with various times countless
times due to the lack of this option, and now lost trees on a
git server with ext4 as well yesterday.  It's time to make git
safe by default.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 Documentation/config.txt | 6 ++----
 environment.c            | 2 +-
 2 files changed, 3 insertions(+), 5 deletions(-)

Junio C Hamano Jan. 17, 2018, 7:04 p.m. UTC | #1

Christoph Hellwig <hch@lst.de> writes:

> fsync is required for data integrity as there is no gurantee that
> data makes it to disk at any specified time without it.  Even for
> ext3 with data=ordered mode the file system will only commit all
> data at some point in time that is not guaranteed.

It comes from this one:

commit aafe9fbaf4f1d1f27a6f6e3eb3e246fff81240ef
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Wed Jun 18 15:18:44 2008 -0700

    Add config option to enable 'fsync()' of object files
    
    As explained in the documentation[*] this is totally useless on
    filesystems that do ordered/journalled data writes, but it can be a
    useful safety feature on filesystems like HFS+ that only journal the
    metadata, not the actual file contents.
    
    It defaults to off, although we could presumably in theory some day
    auto-enable it on a per-filesystem basis.
    
    [*] Yes, I updated the docs for the thing.  Hell really _has_ frozen
        over, and the four horsemen are probably just beyond the horizon.
        EVERYBODY PANIC!
    
> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index 0e25b2c92..9a1cec5c8 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -866,10 +866,8 @@ core.whitespace::
>  core.fsyncObjectFiles::
>  	This boolean will enable 'fsync()' when writing object files.
>  +
> -This is a total waste of time and effort on a filesystem that orders
> -data writes properly, but can be useful for filesystems that do not use
> -journalling (traditional UNIX filesystems) or that only journal metadata
> -and not file contents (OS X's HFS+, or Linux ext3 with "data=writeback").
> +This option is enabled by default and ensures actual data integrity
> +by calling fsync after writing object files.

I am somewhat sympathetic to the desire to flip the default to
"safe" and allow those who know they are already safe to tweak the
knob for performance, and it also makes sense to document that the
default is "true" here.  But I do not see the point of removing the
four lines from this paragraph; the sole effect of the removal is to
rob information from readers that they can use to decide if they
want to disable the configuration, no?

Matthew Wilcox Jan. 17, 2018, 7:37 p.m. UTC | #2

On Wed, Jan 17, 2018 at 11:04:32AM -0800, Junio C Hamano wrote:
> Christoph Hellwig <hch@lst.de> writes:
> > @@ -866,10 +866,8 @@ core.whitespace::
> >  core.fsyncObjectFiles::
> >  	This boolean will enable 'fsync()' when writing object files.
> >  +
> > -This is a total waste of time and effort on a filesystem that orders
> > -data writes properly, but can be useful for filesystems that do not use
> > -journalling (traditional UNIX filesystems) or that only journal metadata
> > -and not file contents (OS X's HFS+, or Linux ext3 with "data=writeback").
> > +This option is enabled by default and ensures actual data integrity
> > +by calling fsync after writing object files.
> 
> I am somewhat sympathetic to the desire to flip the default to
> "safe" and allow those who know they are already safe to tweak the
> knob for performance, and it also makes sense to document that the
> default is "true" here.  But I do not see the point of removing the
> four lines from this paragraph; the sole effect of the removal is to
> rob information from readers that they can use to decide if they
> want to disable the configuration, no?

How about this instead?

This option is enabled by default and ensures data integrity by calling
fsync after writing object files.  It is not necessary on filesystems
which journal data writes, but is still necessary on filesystems which
do not use journalling (ext2), or that only journal metadata writes
(OS X's HFS+, or Linux's ext4 with "data=writeback").  Turning this
option off will increase performance at the possible risk of data loss.

Christoph Hellwig Jan. 17, 2018, 7:42 p.m. UTC | #3

On Wed, Jan 17, 2018 at 11:37:31AM -0800, Matthew Wilcox wrote:
> How about this instead?
> 
> This option is enabled by default and ensures data integrity by calling
> fsync after writing object files.  It is not necessary on filesystems
> which journal data writes, but is still necessary on filesystems which
> do not use journalling (ext2), or that only journal metadata writes
> (OS X's HFS+, or Linux's ext4 with "data=writeback").  Turning this
> option off will increase performance at the possible risk of data loss.

I think this goes entirely into the wrong direction.  The point is
fsync is always the right thing to do.  But on ext3 (and ext3 only)
the right thing is way too painful, and conveniently ext3 happens
to be almost ok without it.  So if anything should get a special
mention it is ext3.

Jeff King Jan. 17, 2018, 8:55 p.m. UTC | #4

On Wed, Jan 17, 2018 at 07:48:28PM +0100, Christoph Hellwig wrote:

> fsync is required for data integrity as there is no gurantee that
> data makes it to disk at any specified time without it.  Even for
> ext3 with data=ordered mode the file system will only commit all
> data at some point in time that is not guaranteed.
> 
> I've lost data on development machines with various times countless
> times due to the lack of this option, and now lost trees on a
> git server with ext4 as well yesterday.  It's time to make git
> safe by default.

I'm definitely sympathetic, and I've contemplated a patch like this a
few times. But I'm not sure we're "safe by default" here after this
patch. In particular:

  1. This covers only loose objects. We generally sync pack writes
     already, so we're covered there. But we do not sync ref updates at
     all, which we'd probably want to in a default-safe setup (a common
     post-crash symptom I've seen is zero-length ref files).

  2. Is it sufficient to fsync() the individual file's descriptors?
     We often do other filesystem operations (like hardlinking or
     renaming) that also need to be committed to disk before an
     operation can be considered saved.

  3. Related to (2), we often care about the order of metadata commits.
     E.g., a common sequence is:

       a. Write object contents to tempfile.

       b. rename() or hardlink tempfile to final name.

       c. Write object name into ref.lock file.

       d. rename() ref.lock to ref

     If we see (d) but not (b), then the result is a corrupted
     repository. Is this guaranteed by ext4 journaling with
     data=ordered?

It may be that data=ordered gets us what we need for (2) and (3). But I
think at the very least we should consider fsyncing ref updates based on
a config option, too.

-Peff

Christoph Hellwig Jan. 17, 2018, 9:10 p.m. UTC | #5

On Wed, Jan 17, 2018 at 03:55:09PM -0500, Jeff King wrote:
> I'm definitely sympathetic, and I've contemplated a patch like this a
> few times. But I'm not sure we're "safe by default" here after this
> patch. In particular:
> 
>   1. This covers only loose objects. We generally sync pack writes
>      already, so we're covered there. But we do not sync ref updates at
>      all, which we'd probably want to in a default-safe setup (a common
>      post-crash symptom I've seen is zero-length ref files).

I've not seen them myself yet, but yes, they need an fsync.

>   2. Is it sufficient to fsync() the individual file's descriptors?
>      We often do other filesystem operations (like hardlinking or
>      renaming) that also need to be committed to disk before an
>      operation can be considered saved.

No, for metadata operations we need to fsync the directory as well.

>   3. Related to (2), we often care about the order of metadata commits.
>      E.g., a common sequence is:
> 
>        a. Write object contents to tempfile.
> 
>        b. rename() or hardlink tempfile to final name.
> 
>        c. Write object name into ref.lock file.
> 
>        d. rename() ref.lock to ref
> 
>      If we see (d) but not (b), then the result is a corrupted
>      repository. Is this guaranteed by ext4 journaling with
>      data=ordered?

It is not generally guranteed by Linux file system semantics.  Various
file system will actually start writeback of file data before rename,
but not actually wait on it.

Ævar Arnfjörð Bjarmason Jan. 17, 2018, 9:44 p.m. UTC | #6

On Wed, Jan 17 2018, Junio C. Hamano jotted:

> Christoph Hellwig <hch@lst.de> writes:
>
>> fsync is required for data integrity as there is no gurantee that
>> data makes it to disk at any specified time without it.  Even for
>> ext3 with data=ordered mode the file system will only commit all
>> data at some point in time that is not guaranteed.
>
> It comes from this one:
>
> commit aafe9fbaf4f1d1f27a6f6e3eb3e246fff81240ef
> Author: Linus Torvalds <torvalds@linux-foundation.org>
> Date:   Wed Jun 18 15:18:44 2008 -0700
>
>     Add config option to enable 'fsync()' of object files
>
>     As explained in the documentation[*] this is totally useless on
>     filesystems that do ordered/journalled data writes, but it can be a
>     useful safety feature on filesystems like HFS+ that only journal the
>     metadata, not the actual file contents.
>
>     It defaults to off, although we could presumably in theory some day
>     auto-enable it on a per-filesystem basis.
>
>     [*] Yes, I updated the docs for the thing.  Hell really _has_ frozen
>         over, and the four horsemen are probably just beyond the horizon.
>         EVERYBODY PANIC!
>
>> diff --git a/Documentation/config.txt b/Documentation/config.txt
>> index 0e25b2c92..9a1cec5c8 100644
>> --- a/Documentation/config.txt
>> +++ b/Documentation/config.txt
>> @@ -866,10 +866,8 @@ core.whitespace::
>>  core.fsyncObjectFiles::
>>  	This boolean will enable 'fsync()' when writing object files.
>>  +
>> -This is a total waste of time and effort on a filesystem that orders
>> -data writes properly, but can be useful for filesystems that do not use
>> -journalling (traditional UNIX filesystems) or that only journal metadata
>> -and not file contents (OS X's HFS+, or Linux ext3 with "data=writeback").
>> +This option is enabled by default and ensures actual data integrity
>> +by calling fsync after writing object files.
>
> I am somewhat sympathetic to the desire to flip the default to
> "safe" and allow those who know they are already safe to tweak the
> knob for performance, and it also makes sense to document that the
> default is "true" here.  But I do not see the point of removing the
> four lines from this paragraph; the sole effect of the removal is to
> rob information from readers that they can use to decide if they
> want to disable the configuration, no?

[CC'd the author of the current behavior]

Some points/questions:

 a) Is there some reliable way to test whether this is needed from
    userspace? I'm thinking something like `git update-index
    --test-untracked-cache` but for fsync().

 b) On the filesystems that don't need this, what's the performance
    impact?

    I ran a small test myself on CentOS 7 (3.10) with ext4 data=ordered
    on the tests I thought might do a lot of loose object writes:

      $ GIT_PERF_REPEAT_COUNT=10 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_OPTS="NO_OPENSSL=Y CFLAGS=-O3 -j56" ./run origin/master fsync-on~ fsync-on p3400-rebase.sh p0007-write-cache.sh
      [...]
      Test                                                            fsync-on~         fsync-on
      -------------------------------------------------------------------------------------------------------
      3400.2: rebase on top of a lot of unrelated changes             1.45(1.30+0.17)   1.45(1.28+0.20) +0.0%
      3400.4: rebase a lot of unrelated changes without split-index   4.34(3.71+0.66)   4.33(3.69+0.66) -0.2%
      3400.6: rebase a lot of unrelated changes with split-index      3.38(2.94+0.47)   3.38(2.93+0.47) +0.0%
      0007.2: write_locked_index 3 times (3214 files)                 0.01(0.00+0.00)   0.01(0.00+0.00) +0.0%

   No impact. However I did my own test of running the test suite 10%
   times with/without this patch, and it runs 9% slower:

     fsync-off: avg:21.59 21.50 21.50 21.52 21.53 21.54 21.57 21.59 21.61 21.63 21.95
      fsync-on: avg:23.43 23.21 23.25 23.26 23.26 23.27 23.32 23.49 23.51 23.83 23.88

   Test script at the end of this E-Mail.

 c) What sort of guarantees in this regard do NFS-mounted filesystems
    commonly make?

Test script:

use v5.10.0;
use strict;
use warnings;
use Time::HiRes qw(time);
use List::Util qw(sum);
use Data::Dumper;

my %time;
for my $ref (@ARGV) {
    system "git checkout $ref";
    system qq[make -j56 CFLAGS="-O3 -g" NO_OPENSSL=Y all];
    for (1..10) {
        my $t0 = -time();
        system "(cd t && NO_SVN_TESTS=1 GIT_TEST_HTTPD=0 prove -j56 --state=slow,save t[0-9]*.sh)";
        $t0 += time();
        push @{$time{$ref}} => $t0;
    }
}
for my $ref (sort keys %time) {
    printf "%20s: avg:%.2f %s\n",
        $ref,
        sum(@{$time{$ref}})/@{$time{$ref}},
        join(" ", map { sprintf "%.02f", $_ } sort { $a <=> $b } @{$time{$ref}});
}

Linus Torvalds Jan. 17, 2018, 10:07 p.m. UTC | #7

On Wed, Jan 17, 2018 at 1:44 PM, Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>     I ran a small test myself on CentOS 7 (3.10) with ext4 data=ordered
>     on the tests I thought might do a lot of loose object writes:
>
>       $ GIT_PERF_REPEAT_COUNT=10 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_OPTS="NO_OPENSSL=Y CFLAGS=-O3 -j56" ./run origin/master fsync-on~ fsync-on p3400-rebase.sh p0007-write-cache.sh
>       [...]
>       Test                                                            fsync-on~         fsync-on
>       -------------------------------------------------------------------------------------------------------
>       3400.2: rebase on top of a lot of unrelated changes             1.45(1.30+0.17)   1.45(1.28+0.20) +0.0%
>       3400.4: rebase a lot of unrelated changes without split-index   4.34(3.71+0.66)   4.33(3.69+0.66) -0.2%
>       3400.6: rebase a lot of unrelated changes with split-index      3.38(2.94+0.47)   3.38(2.93+0.47) +0.0%
>       0007.2: write_locked_index 3 times (3214 files)                 0.01(0.00+0.00)   0.01(0.00+0.00) +0.0%
>
>    No impact. However I did my own test of running the test suite 10%
>    times with/without this patch, and it runs 9% slower:
>
>      fsync-off: avg:21.59 21.50 21.50 21.52 21.53 21.54 21.57 21.59 21.61 21.63 21.95
>       fsync-on: avg:23.43 23.21 23.25 23.26 23.26 23.27 23.32 23.49 23.51 23.83 23.88

That's not the thing you should check.

Now re-do the test while another process writes to a totally unrelated
a huge file (say, do a ISO file copy or something).

That was the thing that several filesystems get completely and
horribly wrong. Generally _particularly_ the logging filesystems that
don't even need the fsync, because they use a single log for
everything (so fsync serializes all the writes, not just the writes to
the one file it's fsync'ing).

The original git design was very much to write each object file
without any syncing, because they don't matter since a new object file
- by definition - isn't really reachable. Then sync before writing the
index file or a new ref.

But things have changed, I'm not arguing that the code shouldn't be
made safe by default. I personally refuse to use rotating media on my
machines anyway, largely exactly because of the fsync() issue (things
like "firefox" started doing fsync on the mysql database for stupid
things, and you'd get huge pauses).

But I do think your benchmark is wrong. The case where only git does
something is not interesting or relevant. It really is "git does
something _and_ somebody else writes something entirely unrelated at
the same time" that matters.

                  Linus

Linus Torvalds Jan. 17, 2018, 10:25 p.m. UTC | #8

On Wed, Jan 17, 2018 at 2:07 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> The original git design was very much to write each object file
> without any syncing, because they don't matter since a new object file
> - by definition - isn't really reachable. Then sync before writing the
> index file or a new ref.

.. actually, I think it originally sync'ed only before starting to
actually remove objects due to repacking.

The theory was that you can lose the last commit series or whatever,
and have to git-fsck, and have to re-do it, but you could never lose
long-term work. If your machine crashes, you still remember what you
did just before the crash.

That theory may have been more correct back in the days than it is
now. People who use git might be less willing to treat it like a
filesystem that you can fsck than I was back 10+ ago.

It's worth noting that the commit that Junio pointed to (from 2008)
didn't actually change any behavior. It just allowed people who cared
to change behavior. The original "let's not waste time on fsync every
object write, because we can just re-create the state anyway" behavior
goes back to 2005.

              Linus

Ævar Arnfjörð Bjarmason Jan. 17, 2018, 11:16 p.m. UTC | #9

On Wed, Jan 17 2018, Linus Torvalds jotted:

> On Wed, Jan 17, 2018 at 1:44 PM, Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>
>>     I ran a small test myself on CentOS 7 (3.10) with ext4 data=ordered
>>     on the tests I thought might do a lot of loose object writes:
>>
>>       $ GIT_PERF_REPEAT_COUNT=10 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_OPTS="NO_OPENSSL=Y CFLAGS=-O3 -j56" ./run origin/master fsync-on~ fsync-on p3400-rebase.sh p0007-write-cache.sh
>>       [...]
>>       Test                                                            fsync-on~         fsync-on
>>       -------------------------------------------------------------------------------------------------------
>>       3400.2: rebase on top of a lot of unrelated changes             1.45(1.30+0.17)   1.45(1.28+0.20) +0.0%
>>       3400.4: rebase a lot of unrelated changes without split-index   4.34(3.71+0.66)   4.33(3.69+0.66) -0.2%
>>       3400.6: rebase a lot of unrelated changes with split-index      3.38(2.94+0.47)   3.38(2.93+0.47) +0.0%
>>       0007.2: write_locked_index 3 times (3214 files)                 0.01(0.00+0.00)   0.01(0.00+0.00) +0.0%
>>
>>    No impact. However I did my own test of running the test suite 10%
>>    times with/without this patch, and it runs 9% slower:

That should be "10 times" b.t.w., not "10% times"

>>
>>      fsync-off: avg:21.59 21.50 21.50 21.52 21.53 21.54 21.57 21.59 21.61 21.63 21.95
>>       fsync-on: avg:23.43 23.21 23.25 23.26 23.26 23.27 23.32 23.49 23.51 23.83 23.88
>
> That's not the thing you should check.
>
> Now re-do the test while another process writes to a totally unrelated
> a huge file (say, do a ISO file copy or something).
>
> That was the thing that several filesystems get completely and
> horribly wrong. Generally _particularly_ the logging filesystems that
> don't even need the fsync, because they use a single log for
> everything (so fsync serializes all the writes, not just the writes to
> the one file it's fsync'ing).
>
> The original git design was very much to write each object file
> without any syncing, because they don't matter since a new object file
> - by definition - isn't really reachable. Then sync before writing the
> index file or a new ref.
>>
> But things have changed, I'm not arguing that the code shouldn't be
> made safe by default. I personally refuse to use rotating media on my
> machines anyway, largely exactly because of the fsync() issue (things
> like "firefox" started doing fsync on the mysql database for stupid
> things, and you'd get huge pauses).
>
> But I do think your benchmark is wrong. The case where only git does
> something is not interesting or relevant. It really is "git does
> something _and_ somebody else writes something entirely unrelated at
> the same time" that matters.

Yeah it's shitty, just a quick hack to get some since there was a
discussion of performance, but neither your original patch or this
thread had quoted any.

One thing you may have missed is that this is a parallel (56 tests at a
time) run of the full test suite. So there's plenty of other git
processes (and test setup/teardown) racing with any given git
process. Running the test suite in a loop like this gives me ~100K IO
ops/s & ~50% disk utilization.

Or does overall FS activity and raw throughput (e.g. with an ISO copy)
matter more than general FS contention?

Tweaking it to emulate this iso copy case, running another test with one
of these running concurrently:

    # setup
    dd if=/dev/urandom of=/tmp/fake.iso bs=1024 count=$((1000*1024))
    # run in a loop (shuf to not always write the same thing)
    while sleep 0.1; do shuf /tmp/fake.iso | pv >/tmp/fake.shuf.iso; done

Gives throughput that spikes to 100% (not consistently) and:

    fsync-off: avg:36.37 31.74 33.83 35.12 36.19 36.32 37.04 37.34 37.71 37.93 40.43
     fsync-on: avg:38.09 34.56 35.14 35.69 36.41 36.41 37.96 38.25 40.45 41.44 44.59

~4.7% slower, v.s. ~8.5% in my earlier
87h8rki2iu.fsf@evledraar.gmail.com without that running.

Which is not an argument for / against this patch, but those numbers
seem significant, and generally if the entire test suite slows down by
that much there's going to be sub-parts of it that are much worse.

Which might be a reason to tread more carefully and if it *does* slow
things down perhaps do it with more granularity, e.g. turning it on in
git-receive-pack might be more sensible than in git-filter-branch.

I remember Roberto Tyley's BFG writing an amazing amount of loose
objects, but it doesn't seem to have an fsync() option, I wonder if
adding one would be a representative pathological test case.

Linus Torvalds Jan. 17, 2018, 11:42 p.m. UTC | #10

On Wed, Jan 17, 2018 at 3:16 PM, Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> Or does overall FS activity and raw throughput (e.g. with an ISO copy)
> matter more than general FS contention?

Traditionally, yes.

Also note that none of this is about "throughput". It's about waiting
for a second or two when you do a simple "git commit" or similar.

Also, hardware has changed. I haven't used a rotational disk in about
10 years now, and I don't worry about latencies so much more.

The fact that you get ~100k iops indicates that you probably aren't
using those stupid platters of rust either. So I doubt you can even
trigger the bad cases.

                Linus

Theodore Ts'o Jan. 17, 2018, 11:52 p.m. UTC | #11

On Wed, Jan 17, 2018 at 02:07:22PM -0800, Linus Torvalds wrote:
> 
> Now re-do the test while another process writes to a totally unrelated
> a huge file (say, do a ISO file copy or something).
> 
> That was the thing that several filesystems get completely and
> horribly wrong. Generally _particularly_ the logging filesystems that
> don't even need the fsync, because they use a single log for
> everything (so fsync serializes all the writes, not just the writes to
> the one file it's fsync'ing).

Well, let's be fair; this is something *ext3* got wrong, and it was
the default file system back them.  All of the modern file systems now
do delayed allocation, which means that an fsync of one file doesn't
actually imply an fsync of another file.  Hence...

> The original git design was very much to write each object file
> without any syncing, because they don't matter since a new object file
> - by definition - isn't really reachable. Then sync before writing the
> index file or a new ref.

This isn't really safe any more.  Yes, there's a single log.  But
files which are subject to delayed allocation are in the page cache,
and just because you fsync the index file doesn't mean that the object
file is now written to disk.  It was true for ext3, but it's not true
for ext4, xfs, btrfs, etc.

The good news is that if you have another process downloading a huge
ISO image, the fsync of the index file won't force the ISO file to be
written out.  The bad news is that it won't force out the other git
object files, either.

Now, there is a potential downside of fsync'ing each object file, and
that is the cost of doing a CACHE FLUSH on a HDD is non-trivial, and
even on a SSD, it's not optimal to call CACHE FLUSH thousands of times
in a second.  So if you are creating thousands of tiny files, and you
fsync each one, each fsync(2) call is a serializing instruction, which
means it won't return until that one file is written to disk.  If you
are writing lots of small files, and you are using a HDD, you'll be
bottlenecked to around 30 files per second on a 5400 RPM HDD, and this
is true regardless of what file system you use, because the bottle
neck is the CACHE FLUSH operation, and how you organize the metadata
and how you do the block allocation, is largely lost in the noise
compared to the CACHE FLUSH command, which serializes everything.

There are solutions to this; you could simply not call fsync(2) a
thousand times, and instead write a pack file, and call fsync once on
the pack file.  That's probably the smartest approach.

You could also create a thousand threads, and call fsync(2) on those
thousand threads at roughly the same time.  Or you could use a
bleeding edge kernel with the latest AIO patch, and use the newly
added IOCB_CMD_FSYNC support.

But I'd simply recommend writing a pack and fsync'ing the pack,
instead of trying to write a gazillion object files.  (git-repack -A,
I'm looking at you....)

					- Ted

Linus Torvalds Jan. 17, 2018, 11:57 p.m. UTC | #12

On Wed, Jan 17, 2018 at 3:52 PM, Theodore Ts'o <tytso@mit.edu> wrote:
>
> Well, let's be fair; this is something *ext3* got wrong, and it was
> the default file system back them.

I'm pretty sure reiserfs and btrfs did too..

          Linus

Christoph Hellwig Jan. 18, 2018, 4:27 p.m. UTC | #13

[adding Chris to the Cc list - this is about the awful ext3 data=ordered
behavior of syncing the whole file system data and metadata on each
fsync]

On Wed, Jan 17, 2018 at 03:57:53PM -0800, Linus Torvalds wrote:
> On Wed, Jan 17, 2018 at 3:52 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> >
> > Well, let's be fair; this is something *ext3* got wrong, and it was
> > the default file system back them.
> 
> I'm pretty sure reiserfs and btrfs did too..

I'm pretty sure btrfs never did, and reiserfs at least looks like
it currently doesn't but I'd have to dig into history to check if
it ever did.

Junio C Hamano Jan. 19, 2018, 7:08 p.m. UTC | #14

Christoph Hellwig <hch@lst.de> writes:

> [adding Chris to the Cc list - this is about the awful ext3 data=ordered
> behavior of syncing the whole file system data and metadata on each
> fsync]
>
> On Wed, Jan 17, 2018 at 03:57:53PM -0800, Linus Torvalds wrote:
>> On Wed, Jan 17, 2018 at 3:52 PM, Theodore Ts'o <tytso@mit.edu> wrote:
>> >
>> > Well, let's be fair; this is something *ext3* got wrong, and it was
>> > the default file system back them.
>> 
>> I'm pretty sure reiserfs and btrfs did too..
>
> I'm pretty sure btrfs never did, and reiserfs at least looks like
> it currently doesn't but I'd have to dig into history to check if
> it ever did.

So..., is it fair to say that the one you sent in

  https://public-inbox.org/git/20180117193510.GA30657@lst.de/

is the best variant we have seen in this thread so far?  I'll keep
that in my inbox so that I do not forget, but I think we would want
to deal with a hotfix for 2.16 on case insensitive platforms before
this topic.

Thanks.

Theodore Ts'o Jan. 20, 2018, 10:14 p.m. UTC | #15

On Fri, Jan 19, 2018 at 11:08:46AM -0800, Junio C Hamano wrote:
> So..., is it fair to say that the one you sent in
> 
>   https://public-inbox.org/git/20180117193510.GA30657@lst.de/
> 
> is the best variant we have seen in this thread so far?  I'll keep
> that in my inbox so that I do not forget, but I think we would want
> to deal with a hotfix for 2.16 on case insensitive platforms before
> this topic.

It's a simplistic fix, but it will work.  There may very well be
certain workloads which generate a large number of loose objects
(e.g., git repack -A) which will make things go significantly more
slowly as a result.  It might very well be the case that if nothing
else is going on, something like "write all the files without
fsync(2), then use syncfs(2)" would be much faster.  The downside with
that approach is if indeed you were downloading a multi-gigabyte DVD
image at the same time, the syncfs(2) will force a writeback of the
partially writte DVD image, or some other unrelated files.

But if the goal is to just change the default, and then see what
shakes out, and then apply other optimizations later, that's certainly
a valid result.  I've never been fond of the "git repack -A" behavior
where it can generate huge numbers of loose files.  I'd much prefer it
if the other objects ended up in a separate pack file, and then some
other provision made for nuking that pack file some time later.  But
that's expanding the scope significantly over what's currently being
discussed.

						- Ted

Junio C Hamano Jan. 20, 2018, 10:27 p.m. UTC | #16

Theodore Ts'o <tytso@mit.edu> writes:

> ....  I've never been fond of the "git repack -A" behavior
> where it can generate huge numbers of loose files.  I'd much prefer it
> if the other objects ended up in a separate pack file, and then some
> other provision made for nuking that pack file some time later....

Yes, a "cruft pack" that holds unreachable object has been discussed
a few times recently on list, and I do agree that it is a desirable
thing to have in the longer run.

What's tricky is to devise a way to allow us to salvage objects that
are placed in a cruft pack because they are accessed recently,
proving themselves to be no longer crufts.  It could be that a good
way to resurrect them is to explode them to loose form when they are
accessed out of a cruft pack.  We need to worry about interactions
with read-only users if we go that route, but with the current
"explode unreachable to loose, touch their mtime when they are
accessed" scheme ends up ignoring accesses from read-only users that
cannot update mtime, so it might not be too bad.

Chris Mason Jan. 21, 2018, 9:32 p.m. UTC | #17

On 01/18/2018 11:27 AM, Christoph Hellwig wrote:
> [adding Chris to the Cc list - this is about the awful ext3 data=ordered
> behavior of syncing the whole file system data and metadata on each
> fsync]
> 
> On Wed, Jan 17, 2018 at 03:57:53PM -0800, Linus Torvalds wrote:
>> On Wed, Jan 17, 2018 at 3:52 PM, Theodore Ts'o <tytso@mit.edu> wrote:
>>>
>>> Well, let's be fair; this is something *ext3* got wrong, and it was
>>> the default file system back them.
>>
>> I'm pretty sure reiserfs and btrfs did too..
> 
> I'm pretty sure btrfs never did, and reiserfs at least looks like
> it currently doesn't but I'd have to dig into history to check if
> it ever did.
> 

Christoph has this right, btrfs only fsyncs the one file that you've 
asked for, and unrelated data/metadata won't be included.

We've seen big fsync stalls on ext4 caused by data=ordered, so it's 
still possible to trigger on ext4, but much better than ext3.

I do share Ted's concern about the perf impact of the fsyncs on tons of 
loose files, but the defaults should be safety first.

-chris

Ævar Arnfjörð Bjarmason Jan. 22, 2018, 3:09 p.m. UTC | #18

On Sat, Jan 20 2018, Junio C. Hamano jotted:

> Theodore Ts'o <tytso@mit.edu> writes:
>
>> ....  I've never been fond of the "git repack -A" behavior
>> where it can generate huge numbers of loose files.  I'd much prefer it
>> if the other objects ended up in a separate pack file, and then some
>> other provision made for nuking that pack file some time later....
>
> Yes, a "cruft pack" that holds unreachable object has been discussed
> a few times recently on list, and I do agree that it is a desirable
> thing to have in the longer run.
>
> What's tricky is to devise a way to allow us to salvage objects that
> are placed in a cruft pack because they are accessed recently,
> proving themselves to be no longer crufts.  It could be that a good
> way to resurrect them is to explode them to loose form when they are
> accessed out of a cruft pack.  We need to worry about interactions
> with read-only users if we go that route, but with the current
> "explode unreachable to loose, touch their mtime when they are
> accessed" scheme ends up ignoring accesses from read-only users that
> cannot update mtime, so it might not be too bad.

Wouldn't it also make gc pruning more expensive? Now you can repack
regularly and loose objects will be left out of the pack, and then just
rm'd, whereas now it would entail creating new packs (unless the whole
pack was objects meant for removal).

Probably still worth it, but something to keep in mind.

Theodore Ts'o Jan. 22, 2018, 6:09 p.m. UTC | #19

On Mon, Jan 22, 2018 at 04:09:23PM +0100, Ævar Arnfjörð Bjarmason wrote:
> > What's tricky is to devise a way to allow us to salvage objects that
> > are placed in a cruft pack because they are accessed recently,
> > proving themselves to be no longer crufts.  It could be that a good
> > way to resurrect them is to explode them to loose form when they are
> > accessed out of a cruft pack.  We need to worry about interactions
> > with read-only users if we go that route, but with the current
> > "explode unreachable to loose, touch their mtime when they are
> > accessed" scheme ends up ignoring accesses from read-only users that
> > cannot update mtime, so it might not be too bad.
> 
> Wouldn't it also make gc pruning more expensive? Now you can repack
> regularly and loose objects will be left out of the pack, and then just
> rm'd, whereas now it would entail creating new packs (unless the whole
> pack was objects meant for removal).

The idea is that the cruft pack would be all objects that were no
longer referenced.  Hence the proposal that if they ever *are*
accessed, they would be exploded to a loose object at that point.  So
in the common case, the GC would go quickly since the entire pack
could just be rm'ed once it hit the designated expiry time.

Another way of doing things would be to use the mtime of the cruft
pack for the expiry time, and if the curft pack is ever referenced,
its mtime would get updated.  Yet a third way would be to simply clear
the "cruft" bit if it ever *is* referenced.  In the common case, it
would never be referenced, so it could just get deleted, but in the
case where the user has manually "rescued" a set of commits (perhaps
by explicitly setting a branch head to commit id found from a reflog),
the objects would be saved.

So there are many ways it could be managed.

							- Ted

Jeff King Jan. 23, 2018, 12:25 a.m. UTC | #20

On Mon, Jan 22, 2018 at 04:09:23PM +0100, Ævar Arnfjörð Bjarmason wrote:

> > Yes, a "cruft pack" that holds unreachable object has been discussed
> > a few times recently on list, and I do agree that it is a desirable
> > thing to have in the longer run.
> >
> > What's tricky is to devise a way to allow us to salvage objects that
> > are placed in a cruft pack because they are accessed recently,
> > proving themselves to be no longer crufts.  It could be that a good
> > way to resurrect them is to explode them to loose form when they are
> > accessed out of a cruft pack.  We need to worry about interactions
> > with read-only users if we go that route, but with the current
> > "explode unreachable to loose, touch their mtime when they are
> > accessed" scheme ends up ignoring accesses from read-only users that
> > cannot update mtime, so it might not be too bad.
> 
> Wouldn't it also make gc pruning more expensive? Now you can repack
> regularly and loose objects will be left out of the pack, and then just
> rm'd, whereas now it would entail creating new packs (unless the whole
> pack was objects meant for removal).
> 
> Probably still worth it, but something to keep in mind.

That's a good point. I think it would be OK in practice, though, since
normal operations don't tend to create a huge number of unreachable
loose objects (at least compared to the _reachable_ loose objects, which
we're already dealing with). We tend to get unbalanced explosions of
loose objects only because a huge chunk of packed history expired.

It is something to keep in mind when implementing the scheme, though.
Luckily we already have the right behavior implemented via
--pack-loose-unreachable (which is used for "repack -k" currently), so I
think it would just be a matter of passing the right flags from
git-repack.

-Peff

Jeff King Jan. 23, 2018, 12:47 a.m. UTC | #21

On Mon, Jan 22, 2018 at 01:09:03PM -0500, Theodore Ts'o wrote:

> > Wouldn't it also make gc pruning more expensive? Now you can repack
> > regularly and loose objects will be left out of the pack, and then just
> > rm'd, whereas now it would entail creating new packs (unless the whole
> > pack was objects meant for removal).
> 
> The idea is that the cruft pack would be all objects that were no
> longer referenced.  Hence the proposal that if they ever *are*
> accessed, they would be exploded to a loose object at that point.  So
> in the common case, the GC would go quickly since the entire pack
> could just be rm'ed once it hit the designated expiry time.

I think Ævar is talking about the case of:

  1. You make 100 objects that aren't referenced. They're loose.

  2. You run git-gc. They're still too recent to be deleted.

Right now those recent loose objects sit loose, and have zero cost at
the time of gc.  In a "cruft pack" world, you'd pay some I/O to copy
them into the cruft pack, and some CPU to zlib and delta-compress them.
I think that's probably fine, though.

That said, some of what you wrote left me confused, and whether we're
all talking about the same idea. ;) Let me describe the idea I had
mentioned in another thread.  Right now the behavior is basically this:

If an unreachable object becomes referenced, it doesn't immediately get
exploded. During the next gc, whatever new object referenced them would
be one of:

  1. Reachable from refs, in which case it carries along the
     formerly-cruft object into the new pack, since it is now also
     reachable.

  2. Unreachable but still recent by mtime; we keep such objects, and
     anything they reference (now as unreachable, in this proposal in
     the cruft pack). Now these get either left loose, or exploded loose
     if they were previously packed.

  3. Unreachable and old. Both objects can be dropped totally.

The current strategy is to use the mtimes for "recent", and we use the
pack's mtime for every object in the pack.

So if we pack all the loose objects into a cruft pack, the mtime of the
cruft pack becomes the new gauge for "recent". And if we migrate objects
from old cruft pack to new cruft pack at each gc, then they'll keep
getting their mtimes refreshed, and we'll never drop them.

So we need to either:

  - keep per-object mtimes, so that old ones can age out (i.e., they'd
    hit case 3 and just not get migrated to either the new "real" pack
    or the new cruft pack).

  - keep multiple cruft packs, and let whole packs age out. But then
    cruft objects which get referenced again by other cruft have to get
    copied (not moved!) to new packs. That _probably_ doesn't happen all
    that often, so it might be OK.

> Another way of doing things would be to use the mtime of the cruft
> pack for the expiry time, and if the curft pack is ever referenced,
> its mtime would get updated.  Yet a third way would be to simply clear
> the "cruft" bit if it ever *is* referenced.  In the common case, it
> would never be referenced, so it could just get deleted, but in the
> case where the user has manually "rescued" a set of commits (perhaps
> by explicitly setting a branch head to commit id found from a reflog),
> the objects would be saved.

I don't think we have to worry about "rescued" objects. Those are
reachable, so they'd get copied into the new "real" pack (and then their
cruft pack eventually deleted).

-Peff

Theodore Ts'o Jan. 23, 2018, 5:45 a.m. UTC | #22

On Mon, Jan 22, 2018 at 07:47:10PM -0500, Jeff King wrote:
> 
> I think Ævar is talking about the case of:
> 
>   1. You make 100 objects that aren't referenced. They're loose.
> 
>   2. You run git-gc. They're still too recent to be deleted.
> 
> Right now those recent loose objects sit loose, and have zero cost at
> the time of gc.  In a "cruft pack" world, you'd pay some I/O to copy
> them into the cruft pack, and some CPU to zlib and delta-compress them.
> I think that's probably fine, though.

I wasn't assuming that git-gc would create a cruft pack --- although I
guess it could.  As you say, recent loose objects have relatively zero
cost at the time of gc.  To the extent that the gc has to read lots of
loose files, there may be more seeks in the cold cache case, so there
is actually *some* cost to having the loose objects, but it's not
great.

What I was thinking about instead is that in cases where we know we
are likely to be creating a large number of loose objects (whether
they referenced or not), in a world where we will be calling fsync(2)
after every single loose object being created, pack files start
looking *way* more efficient.  So in general, if you know you will be
creating N loose objects, where N is probably around 50 or so, you'll
want to create a pack instead.

One of those cases is "repack -A", and in that case the loose objects
are all going tobe not referenced, so it would be a "cruft pack".  But
in many other cases where we might be importing from another DCVS,
which will be another case where doing an fsync(2) after every loose
object creation (and where I have sometimes seen it create them *all*
loose, and not use a pack at all), is going to get extremely slow and
painful.

> So if we pack all the loose objects into a cruft pack, the mtime of the
> cruft pack becomes the new gauge for "recent". And if we migrate objects
> from old cruft pack to new cruft pack at each gc, then they'll keep
> getting their mtimes refreshed, and we'll never drop them.

Well, I was assuming that gc would be a special case which doesn't the
mtime of the old cruft pack.  (Or more generally, any time an object
is gets copied out of the cruft pack, either to a loose object, or to
another pack, the mtime on the source pack should not be touched.)

	      	  	       	      	   	   - Ted

Jeff King Jan. 23, 2018, 4:17 p.m. UTC | #23

On Tue, Jan 23, 2018 at 12:45:53AM -0500, Theodore Ts'o wrote:

> What I was thinking about instead is that in cases where we know we
> are likely to be creating a large number of loose objects (whether
> they referenced or not), in a world where we will be calling fsync(2)
> after every single loose object being created, pack files start
> looking *way* more efficient.  So in general, if you know you will be
> creating N loose objects, where N is probably around 50 or so, you'll
> want to create a pack instead.
> 
> One of those cases is "repack -A", and in that case the loose objects
> are all going tobe not referenced, so it would be a "cruft pack".  But
> in many other cases where we might be importing from another DCVS,
> which will be another case where doing an fsync(2) after every loose
> object creation (and where I have sometimes seen it create them *all*
> loose, and not use a pack at all), is going to get extremely slow and
> painful.

Ah, I see. I think in the general case of git operations this is hard
(because most object writes don't realize the larger operation that
they're a part of). But I agree that those two are the low-hanging fruit
(imports should already be using fast-import, and "cruft packs" are not
too hard an idea to implement).

I agree that a cruft-pack implementation could just be for "repack -A",
and does not have to collect otherwise loose objects. I think part of my
confusion was that you and I are coming to the idea from different
angles: you care about minimizing fsyncs, and I'm interested in stopping
the problem where you have too many loose objects after running auto-gc.
So I care more about collecting those loose objects for that case.

> > So if we pack all the loose objects into a cruft pack, the mtime of the
> > cruft pack becomes the new gauge for "recent". And if we migrate objects
> > from old cruft pack to new cruft pack at each gc, then they'll keep
> > getting their mtimes refreshed, and we'll never drop them.
> 
> Well, I was assuming that gc would be a special case which doesn't the
> mtime of the old cruft pack.  (Or more generally, any time an object
> is gets copied out of the cruft pack, either to a loose object, or to
> another pack, the mtime on the source pack should not be touched.)

Right, that's the "you have multiple cruft packs" idea which has been
discussed[1] (each one just hangs around until its mtime expires, and
may duplicate objects found elsewhere).

That does end up with one pack per gc, which just punts the "too many
loose objects" to "too many packs". But unless the number of gc runs you
do is very high compared to the expiration time, we can probably ignore
that.

-Peff

[1] https://public-inbox.org/git/20170610080626.sjujpmgkli4muh7h@sigill.intra.peff.net/

Ævar Arnfjörð Bjarmason Sept. 17, 2020, 11:06 a.m. UTC | #24

On Thu, Jan 18 2018, Theodore Ts'o wrote:

> On Wed, Jan 17, 2018 at 02:07:22PM -0800, Linus Torvalds wrote:
>> 
>> Now re-do the test while another process writes to a totally unrelated
>> a huge file (say, do a ISO file copy or something).
>> 
>> That was the thing that several filesystems get completely and
>> horribly wrong. Generally _particularly_ the logging filesystems that
>> don't even need the fsync, because they use a single log for
>> everything (so fsync serializes all the writes, not just the writes to
>> the one file it's fsync'ing).
>
> Well, let's be fair; this is something *ext3* got wrong, and it was
> the default file system back them.  All of the modern file systems now
> do delayed allocation, which means that an fsync of one file doesn't
> actually imply an fsync of another file.  Hence...
>
>> The original git design was very much to write each object file
>> without any syncing, because they don't matter since a new object file
>> - by definition - isn't really reachable. Then sync before writing the
>> index file or a new ref.
>
> This isn't really safe any more.  Yes, there's a single log.  But
> files which are subject to delayed allocation are in the page cache,
> and just because you fsync the index file doesn't mean that the object
> file is now written to disk.  It was true for ext3, but it's not true
> for ext4, xfs, btrfs, etc.
>
> The good news is that if you have another process downloading a huge
> ISO image, the fsync of the index file won't force the ISO file to be
> written out.  The bad news is that it won't force out the other git
> object files, either.
>
> Now, there is a potential downside of fsync'ing each object file, and
> that is the cost of doing a CACHE FLUSH on a HDD is non-trivial, and
> even on a SSD, it's not optimal to call CACHE FLUSH thousands of times
> in a second.  So if you are creating thousands of tiny files, and you
> fsync each one, each fsync(2) call is a serializing instruction, which
> means it won't return until that one file is written to disk.  If you
> are writing lots of small files, and you are using a HDD, you'll be
> bottlenecked to around 30 files per second on a 5400 RPM HDD, and this
> is true regardless of what file system you use, because the bottle
> neck is the CACHE FLUSH operation, and how you organize the metadata
> and how you do the block allocation, is largely lost in the noise
> compared to the CACHE FLUSH command, which serializes everything.
>
> There are solutions to this; you could simply not call fsync(2) a
> thousand times, and instead write a pack file, and call fsync once on
> the pack file.  That's probably the smartest approach.
>
> You could also create a thousand threads, and call fsync(2) on those
> thousand threads at roughly the same time.  Or you could use a
> bleeding edge kernel with the latest AIO patch, and use the newly
> added IOCB_CMD_FSYNC support.
>
> But I'd simply recommend writing a pack and fsync'ing the pack,
> instead of trying to write a gazillion object files.  (git-repack -A,
> I'm looking at you....)
>
> 					- Ted

[I didn't find an ideal message to reply to in this thread, but this
seemed to probably be the best]

Just an update on this since I went back and looked at this thread,
GitLab about ~1yr ago turned on core.fsyncObjectFiles=true by
default.

The reason is detailed in [1], tl;dr: empty loose object file issue on
ext4 allegedly caused by a lack of core.fsyncObjectFiles=true, but I
didn't do any root cause analysis. Just noting it here for for future
reference.

1. https://gitlab.com/gitlab-org/gitlab-foss/-/issues/51680#note_180508774

Christoph Hellwig Sept. 17, 2020, 2:14 p.m. UTC | #25

On Thu, Sep 17, 2020 at 01:06:50PM +0200, Ævar Arnfjörð Bjarmason wrote:
> The reason is detailed in [1], tl;dr: empty loose object file issue on
> ext4 allegedly caused by a lack of core.fsyncObjectFiles=true, but I
> didn't do any root cause analysis. Just noting it here for for future
> reference.

All the modern Linux file systems first write the data, and then only
write the metadata after the data write has finished.  So your data might
have been partially or fully on disk, but until the transaction to commit
the size change and/or extent state change you're not going to be able
to read it back.

Junio C Hamano Sept. 17, 2020, 3:30 p.m. UTC | #26

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> [I didn't find an ideal message to reply to in this thread, but this
> seemed to probably be the best]
>
> Just an update on this since I went back and looked at this thread,
> GitLab about ~1yr ago turned on core.fsyncObjectFiles=true by
> default.
>
> The reason is detailed in [1], tl;dr: empty loose object file issue on
> ext4 allegedly caused by a lack of core.fsyncObjectFiles=true, but I
> didn't do any root cause analysis. Just noting it here for for future
> reference.
>
> 1. https://gitlab.com/gitlab-org/gitlab-foss/-/issues/51680#note_180508774

Thanks for bringing the original discussion back.  I do recall the
discussion and the list of issues to think about raised by Peff back
then in [2] is still relevant.


2. https://public-inbox.org/git/20180117205509.GA14828@sigill.intra.peff.net/

enable core.fsyncObjectFiles by default

Commit Message

Comments

Patch