diff mbox series

[RFC,1/2] sha1-file: fsync() loose dir entry when core.fsyncObjectFiles

Message ID 20200917112830.26606-2-avarab@gmail.com (mailing list archive)
State New, archived
Headers show
Series should core.fsyncObjectFiles fsync the dir entry + docs | expand

Commit Message

Ævar Arnfjörð Bjarmason Sept. 17, 2020, 11:28 a.m. UTC
Change the behavior of core.fsyncObjectFiles to also sync the
directory entry. I don't have a case where this broke, just going by
paranoia and the fsync(2) manual page's guarantees about its behavior.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 sha1-file.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

Comments

Jeff King Sept. 17, 2020, 1:16 p.m. UTC | #1
On Thu, Sep 17, 2020 at 01:28:29PM +0200, Ævar Arnfjörð Bjarmason wrote:

> Change the behavior of core.fsyncObjectFiles to also sync the
> directory entry. I don't have a case where this broke, just going by
> paranoia and the fsync(2) manual page's guarantees about its behavior.

I've also often wondered whether this is necessary. Given the symptom of
"oops, this object is there but with 0 bytes" after a hard crash (power
off, etc), my assumption is that the metadata is being journaled but the
actual data is not. Which would imply this isn't needed, but may just be
revealing my naive view of how filesystems work.

And of course all of my experience is on ext4 (which doubly confuses me,
because my systems typically have data=ordered, which I thought would
solve this). Non-journalling filesystems or other modes likely behave
differently, but if this extra fsync carries a cost, we may want to make
it optional.

>  sha1-file.c | 19 ++++++++++++++-----
>  1 file changed, 14 insertions(+), 5 deletions(-)

We already fsync pack files, but we don't fsync their directories. If
this is important to do, we should be doing it there, too.

We also don't fsync ref files (nor packed-refs) at all. If fsyncing
files is important for reliability, we should be including those, too.
It may be tempting to say that the important stuff is in objects and the
refs can be salvaged from the commit graph, but my experience says
otherwise. Missing, broken, or mysteriously-rewound refs cause confusing
user-visible behavior, and when compounded with pruning operations like
"git gc" they _do_ result in losing objects.

-Peff
Christoph Hellwig Sept. 17, 2020, 2:09 p.m. UTC | #2
On Thu, Sep 17, 2020 at 01:28:29PM +0200, Ævar Arnfjörð Bjarmason wrote:
> Change the behavior of core.fsyncObjectFiles to also sync the
> directory entry. I don't have a case where this broke, just going by
> paranoia and the fsync(2) manual page's guarantees about its behavior.

It is not just paranoia, but indeed what is required from the standards
POV.  At least for many Linux file systems your second fsync will be
very cheap (basically a NULL syscall) as the log has alredy been forced
all the way by the first one, but you can't rely on that.

Acked-by: Christoph Hellwig <hch@lst.de>
Jeff King Sept. 17, 2020, 2:55 p.m. UTC | #3
On Thu, Sep 17, 2020 at 04:09:12PM +0200, Christoph Hellwig wrote:

> On Thu, Sep 17, 2020 at 01:28:29PM +0200, Ævar Arnfjörð Bjarmason wrote:
> > Change the behavior of core.fsyncObjectFiles to also sync the
> > directory entry. I don't have a case where this broke, just going by
> > paranoia and the fsync(2) manual page's guarantees about its behavior.
> 
> It is not just paranoia, but indeed what is required from the standards
> POV.  At least for many Linux file systems your second fsync will be
> very cheap (basically a NULL syscall) as the log has alredy been forced
> all the way by the first one, but you can't rely on that.

Is it sufficient to fsync() just the surrounding directory? I.e., if I
do:

  mkdir("a");
  mkdir("a/b");
  open("a/b/c", O_WRONLY);

is it enough to fsync() a descriptor pointing to "a/b", or should I
also do "a"?

-Peff
Christoph Hellwig Sept. 17, 2020, 2:56 p.m. UTC | #4
On Thu, Sep 17, 2020 at 10:55:23AM -0400, Jeff King wrote:
> On Thu, Sep 17, 2020 at 04:09:12PM +0200, Christoph Hellwig wrote:
> 
> > On Thu, Sep 17, 2020 at 01:28:29PM +0200, Ævar Arnfjörð Bjarmason wrote:
> > > Change the behavior of core.fsyncObjectFiles to also sync the
> > > directory entry. I don't have a case where this broke, just going by
> > > paranoia and the fsync(2) manual page's guarantees about its behavior.
> > 
> > It is not just paranoia, but indeed what is required from the standards
> > POV.  At least for many Linux file systems your second fsync will be
> > very cheap (basically a NULL syscall) as the log has alredy been forced
> > all the way by the first one, but you can't rely on that.
> 
> Is it sufficient to fsync() just the surrounding directory? I.e., if I
> do:
> 
>   mkdir("a");
>   mkdir("a/b");
>   open("a/b/c", O_WRONLY);
> 
> is it enough to fsync() a descriptor pointing to "a/b", or should I
> also do "a"?

You need to fsync both to be fully compliant, even if just fsyncing b
will work for most but not all file systems.  The good news is that
for those common file systems the extra fsync of a is almost free.
Christoph Hellwig Sept. 17, 2020, 3:09 p.m. UTC | #5
On Thu, Sep 17, 2020 at 09:16:05AM -0400, Jeff King wrote:
> I've also often wondered whether this is necessary. Given the symptom of
> "oops, this object is there but with 0 bytes" after a hard crash (power
> off, etc), my assumption is that the metadata is being journaled but the
> actual data is not. Which would imply this isn't needed, but may just be
> revealing my naive view of how filesystems work.
> 
> And of course all of my experience is on ext4 (which doubly confuses me,
> because my systems typically have data=ordered, which I thought would
> solve this). Non-journalling filesystems or other modes likely behave
> differently, but if this extra fsync carries a cost, we may want to make
> it optional.

I hope my other mail clarified how this works at a high level, if not
feel free to ask more questions.

> >  sha1-file.c | 19 ++++++++++++++-----
> >  1 file changed, 14 insertions(+), 5 deletions(-)
> 
> We already fsync pack files, but we don't fsync their directories. If
> this is important to do, we should be doing it there, too.
> 
> We also don't fsync ref files (nor packed-refs) at all. If fsyncing
> files is important for reliability, we should be including those, too.
> It may be tempting to say that the important stuff is in objects and the
> refs can be salvaged from the commit graph, but my experience says
> otherwise. Missing, broken, or mysteriously-rewound refs cause confusing
> user-visible behavior, and when compounded with pruning operations like
> "git gc" they _do_ result in losing objects.

True, this probably needs to do for the directories of other files
as well.

One interesting optimization under linux is the syncfs syscall, that
syncs all files on a file system - if you need to do a large number
of fsyncs that do not depend on each other for transaction semantics
it can provide a huge speedup.
Junio C Hamano Sept. 17, 2020, 3:37 p.m. UTC | #6
Christoph Hellwig <hch@lst.de> writes:

> On Thu, Sep 17, 2020 at 10:55:23AM -0400, Jeff King wrote:
>> On Thu, Sep 17, 2020 at 04:09:12PM +0200, Christoph Hellwig wrote:
>> 
>> > On Thu, Sep 17, 2020 at 01:28:29PM +0200, Ævar Arnfjörð Bjarmason wrote:
>> > > Change the behavior of core.fsyncObjectFiles to also sync the
>> > > directory entry. I don't have a case where this broke, just going by
>> > > paranoia and the fsync(2) manual page's guarantees about its behavior.
>> > 
>> > It is not just paranoia, but indeed what is required from the standards
>> > POV.  At least for many Linux file systems your second fsync will be
>> > very cheap (basically a NULL syscall) as the log has alredy been forced
>> > all the way by the first one, but you can't rely on that.
>> 
>> Is it sufficient to fsync() just the surrounding directory? I.e., if I
>> do:
>> 
>>   mkdir("a");
>>   mkdir("a/b");
>>   open("a/b/c", O_WRONLY);
>> 
>> is it enough to fsync() a descriptor pointing to "a/b", or should I
>> also do "a"?
>
> You need to fsync both to be fully compliant, even if just fsyncing b
> will work for most but not all file systems.  The good news is that
> for those common file systems the extra fsync of a is almost free.

Back to Ævar's patch, when creating a new loose object, we do these
things:

 1. create temporary file and write the compressed contents to it
    while computing its object name

 2. create the fan-out directory under .git/objects/ if needed

 3. mv temporary file to its final name

and the patch adds open+fsync+close on the fan-out directory.  In
the above exchange with Peff, we learned that open+fsync+close needs
to be done on .git/objects if we created the fan-out directory, too.

Am I reading the above correctly?

Thanks.
Jeff King Sept. 17, 2020, 5:12 p.m. UTC | #7
On Thu, Sep 17, 2020 at 08:37:19AM -0700, Junio C Hamano wrote:

> > You need to fsync both to be fully compliant, even if just fsyncing b
> > will work for most but not all file systems.  The good news is that
> > for those common file systems the extra fsync of a is almost free.
> 
> Back to Ævar's patch, when creating a new loose object, we do these
> things:
> 
>  1. create temporary file and write the compressed contents to it
>     while computing its object name
> 
>  2. create the fan-out directory under .git/objects/ if needed
> 
>  3. mv temporary file to its final name
> 
> and the patch adds open+fsync+close on the fan-out directory.  In
> the above exchange with Peff, we learned that open+fsync+close needs
> to be done on .git/objects if we created the fan-out directory, too.
> 
> Am I reading the above correctly?

That's my understanding. It gets trickier with refs (which I think we
also ought to consider fsyncing), as we may create arbitrarily deep
hierarchies (so we'd have to keep track of which parts got created, or
just conservatively fsync up the whole hierarchy).

It gets a lot easier if we move to reftables that have a more
predictable file/disk structure.

-Peff
Johannes Sixt Sept. 17, 2020, 8:21 p.m. UTC | #8
Am 17.09.20 um 13:28 schrieb Ævar Arnfjörð Bjarmason:
> Change the behavior of core.fsyncObjectFiles to also sync the
> directory entry. I don't have a case where this broke, just going by
> paranoia and the fsync(2) manual page's guarantees about its behavior.
> 
> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
> ---
>  sha1-file.c | 19 ++++++++++++++-----
>  1 file changed, 14 insertions(+), 5 deletions(-)
> 
> diff --git a/sha1-file.c b/sha1-file.c
> index dd65bd5c68..d286346921 100644
> --- a/sha1-file.c
> +++ b/sha1-file.c
> @@ -1784,10 +1784,14 @@ int hash_object_file(const struct git_hash_algo *algo, const void *buf,
>  }
>  
>  /* Finalize a file on disk, and close it. */
> -static void close_loose_object(int fd)
> +static void close_loose_object(int fd, const struct strbuf *dirname)
>  {
> -	if (fsync_object_files)
> +	int dirfd;
> +	if (fsync_object_files) {
>  		fsync_or_die(fd, "loose object file");
> +		dirfd = xopen(dirname->buf, O_RDONLY);
> +		fsync_or_die(dirfd, "loose object directory");

Did you have the opportunity to verify that this works on Windows?
Opening a directory with open(2), I mean: It's disallowed according to
the docs:
https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/open-wopen?view=vs-2019#return-value

> +	}
>  	if (close(fd) != 0)
>  		die_errno(_("error when closing loose object file"));
>  }

-- Hannes
Taylor Blau Sept. 17, 2020, 8:37 p.m. UTC | #9
On Thu, Sep 17, 2020 at 01:12:12PM -0400, Jeff King wrote:
> On Thu, Sep 17, 2020 at 08:37:19AM -0700, Junio C Hamano wrote:
> > Am I reading the above correctly?
>
> That's my understanding. It gets trickier with refs (which I think we
> also ought to consider fsyncing), as we may create arbitrarily deep
> hierarchies (so we'd have to keep track of which parts got created, or
> just conservatively fsync up the whole hierarchy).

Yeah, it definitely gets trickier, but hopefully not by much. I
appreciate Christoph's explanation, and certainly buy into it. I can't
think of any reason why we wouldn't want to apply the same reasoning to
storing refs, too.

It shouldn't be a hold-up for this series, though.

> It gets a lot easier if we move to reftables that have a more
> predictable file/disk structure.

In the sense that we don't have to worry about arbitrary-depth loose
references, yes, but I think we'll still have to deal with both cases.

> -Peff

Thanks,
Taylor
Ævar Arnfjörð Bjarmason Sept. 22, 2020, 8:24 a.m. UTC | #10
On Thu, Sep 17 2020, Johannes Sixt wrote:

> Am 17.09.20 um 13:28 schrieb Ævar Arnfjörð Bjarmason:
>> Change the behavior of core.fsyncObjectFiles to also sync the
>> directory entry. I don't have a case where this broke, just going by
>> paranoia and the fsync(2) manual page's guarantees about its behavior.
>> 
>> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
>> ---
>>  sha1-file.c | 19 ++++++++++++++-----
>>  1 file changed, 14 insertions(+), 5 deletions(-)
>> 
>> diff --git a/sha1-file.c b/sha1-file.c
>> index dd65bd5c68..d286346921 100644
>> --- a/sha1-file.c
>> +++ b/sha1-file.c
>> @@ -1784,10 +1784,14 @@ int hash_object_file(const struct git_hash_algo *algo, const void *buf,
>>  }
>>  
>>  /* Finalize a file on disk, and close it. */
>> -static void close_loose_object(int fd)
>> +static void close_loose_object(int fd, const struct strbuf *dirname)
>>  {
>> -	if (fsync_object_files)
>> +	int dirfd;
>> +	if (fsync_object_files) {
>>  		fsync_or_die(fd, "loose object file");
>> +		dirfd = xopen(dirname->buf, O_RDONLY);
>> +		fsync_or_die(dirfd, "loose object directory");
>
> Did you have the opportunity to verify that this works on Windows?
> Opening a directory with open(2), I mean: It's disallowed according to
> the docs:
> https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/open-wopen?view=vs-2019#return-value

I did not, just did a quick hack for an RFC discussion (didn't even
close() that fd), but if I pursue this I'll do it properly.

Doing some research on it now reveals that we should probably have some
Windows-specific code here, e.g. browsing GNUlib's source code reveals
that it uses FlushFileBuffers(), and that code itself is taken from
sqlite. SQLite also has special-case code for some Unix warts,
e.g. OSX's and AIX's special fsync behaviors in its src/os_unix.c

>> +	}
>>  	if (close(fd) != 0)
>>  		die_errno(_("error when closing loose object file"));
>>  }
>
> -- Hannes
Ævar Arnfjörð Bjarmason Sept. 22, 2020, 10:42 a.m. UTC | #11
On Thu, Sep 17 2020, Christoph Hellwig wrote:

> On Thu, Sep 17, 2020 at 01:28:29PM +0200, Ævar Arnfjörð Bjarmason wrote:
>> Change the behavior of core.fsyncObjectFiles to also sync the
>> directory entry. I don't have a case where this broke, just going by
>> paranoia and the fsync(2) manual page's guarantees about its behavior.
>
> It is not just paranoia, but indeed what is required from the standards
> POV.  At least for many Linux file systems your second fsync will be
> very cheap (basically a NULL syscall) as the log has alredy been forced
> all the way by the first one, but you can't rely on that.
>
> Acked-by: Christoph Hellwig <hch@lst.de>

Thanks a lot for your advice in this thread.

Can you (or someone else) suggest a Linux fs setup that's as unforgiving
as possible vis-a-vis fsync() for testing? I'd like to hack on making
git better at this, but one of the problems of testing it is that modern
filesystems generally do a pretty good job of not losing your data.

So something like ext4's commit=N is an obvious start, but for git's own
test suite it would be ideal to have process A write file X, and then
have process B try to read it and just not see it if X hadn't been
fsynced (or not see its directory if that hadn't been synced).

It would turn our test suite into pretty much a 100% failure, but one
that could then be fixed by fixing the relevant file writing code.
Johannes Schindelin Nov. 19, 2020, 11:38 a.m. UTC | #12
Hi Ævar,

On Tue, 22 Sep 2020, Ævar Arnfjörð Bjarmason wrote:

> On Thu, Sep 17 2020, Johannes Sixt wrote:
>
> > Am 17.09.20 um 13:28 schrieb Ævar Arnfjörð Bjarmason:
> >> Change the behavior of core.fsyncObjectFiles to also sync the
> >> directory entry. I don't have a case where this broke, just going by
> >> paranoia and the fsync(2) manual page's guarantees about its behavior.
> >>
> >> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
> >> ---
> >>  sha1-file.c | 19 ++++++++++++++-----
> >>  1 file changed, 14 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/sha1-file.c b/sha1-file.c
> >> index dd65bd5c68..d286346921 100644
> >> --- a/sha1-file.c
> >> +++ b/sha1-file.c
> >> @@ -1784,10 +1784,14 @@ int hash_object_file(const struct git_hash_algo *algo, const void *buf,
> >>  }
> >>
> >>  /* Finalize a file on disk, and close it. */
> >> -static void close_loose_object(int fd)
> >> +static void close_loose_object(int fd, const struct strbuf *dirname)
> >>  {
> >> -	if (fsync_object_files)
> >> +	int dirfd;
> >> +	if (fsync_object_files) {
> >>  		fsync_or_die(fd, "loose object file");
> >> +		dirfd = xopen(dirname->buf, O_RDONLY);
> >> +		fsync_or_die(dirfd, "loose object directory");
> >
> > Did you have the opportunity to verify that this works on Windows?
> > Opening a directory with open(2), I mean: It's disallowed according to
> > the docs:
> > https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/open-wopen?view=vs-2019#return-value
>
> I did not, just did a quick hack for an RFC discussion (didn't even
> close() that fd), but if I pursue this I'll do it properly.
>
> Doing some research on it now reveals that we should probably have some
> Windows-specific code here, e.g. browsing GNUlib's source code reveals
> that it uses FlushFileBuffers(), and that code itself is taken from
> sqlite. SQLite also has special-case code for some Unix warts,
> e.g. OSX's and AIX's special fsync behaviors in its src/os_unix.c

If I understand correctly, the idea to `fsync()` directories is to ensure
that metadata updates (such as renames) are flushed, too?

If so (and please note that my understanding of NTFS is not super deep in
this regard), I think that we need not worry on Windows. I have come to
believe that the `rename()` operations are flushed pretty much
immediately, without any `FlushFileBuffers()` (or `_commit()`, as we
actually do in `compat/mingw.h`, to convince yourself see
https://github.com/git/git/blob/v2.29.2/compat/mingw.h#L135-L136).

Directories are not mentioned in `FlushFileBuffers()`'s documentation
(https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-flushfilebuffers)
nor in the documentation of `_commit()`:
https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/commit?view=msvc-160

Therefore, I believe that there is not even a Win32 equivalent of
`fsync()`ing directories.

Ciao,
Dscho

>
> >> +	} if (close(fd) != 0) die_errno(_("error when closing loose object
> >> file")); }
> >
> > -- Hannes
>
>
>
diff mbox series

Patch

diff --git a/sha1-file.c b/sha1-file.c
index dd65bd5c68..d286346921 100644
--- a/sha1-file.c
+++ b/sha1-file.c
@@ -1784,10 +1784,14 @@  int hash_object_file(const struct git_hash_algo *algo, const void *buf,
 }
 
 /* Finalize a file on disk, and close it. */
-static void close_loose_object(int fd)
+static void close_loose_object(int fd, const struct strbuf *dirname)
 {
-	if (fsync_object_files)
+	int dirfd;
+	if (fsync_object_files) {
 		fsync_or_die(fd, "loose object file");
+		dirfd = xopen(dirname->buf, O_RDONLY);
+		fsync_or_die(dirfd, "loose object directory");
+	}
 	if (close(fd) != 0)
 		die_errno(_("error when closing loose object file"));
 }
@@ -1808,12 +1812,15 @@  static inline int directory_size(const char *filename)
  * We want to avoid cross-directory filename renames, because those
  * can have problems on various filesystems (FAT, NFS, Coda).
  */
-static int create_tmpfile(struct strbuf *tmp, const char *filename)
+static int create_tmpfile(struct strbuf *tmp,
+			  const char *filename,
+			  struct strbuf *dirname)
 {
 	int fd, dirlen = directory_size(filename);
 
 	strbuf_reset(tmp);
 	strbuf_add(tmp, filename, dirlen);
+	strbuf_add(dirname, filename, dirlen);
 	strbuf_addstr(tmp, "tmp_obj_XXXXXX");
 	fd = git_mkstemp_mode(tmp->buf, 0444);
 	if (fd < 0 && dirlen && errno == ENOENT) {
@@ -1848,10 +1855,11 @@  static int write_loose_object(const struct object_id *oid, char *hdr,
 	struct object_id parano_oid;
 	static struct strbuf tmp_file = STRBUF_INIT;
 	static struct strbuf filename = STRBUF_INIT;
+	static struct strbuf dirname = STRBUF_INIT;
 
 	loose_object_path(the_repository, &filename, oid);
 
-	fd = create_tmpfile(&tmp_file, filename.buf);
+	fd = create_tmpfile(&tmp_file, filename.buf, &dirname);
 	if (fd < 0) {
 		if (errno == EACCES)
 			return error(_("insufficient permission for adding an object to repository database %s"), get_object_directory());
@@ -1897,7 +1905,8 @@  static int write_loose_object(const struct object_id *oid, char *hdr,
 		die(_("confused by unstable object source data for %s"),
 		    oid_to_hex(oid));
 
-	close_loose_object(fd);
+	close_loose_object(fd, &dirname);
+	strbuf_release(&dirname);
 
 	if (mtime) {
 		struct utimbuf utb;