diff mbox series

[v2,2/7] core.fsyncmethod: batched disk flushes for loose-objects

Message ID 3ed1dcd9b9ba9b34f26b3012eaba8da0269ee842.1647760560.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series core.fsyncmethod: add 'batch' mode for faster fsyncing of multiple objects | expand

Commit Message

Neeraj Singh (WINDOWS-SFS) March 20, 2022, 7:15 a.m. UTC
From: Neeraj Singh <neerajsi@microsoft.com>

When adding many objects to a repo with `core.fsync=loose-object`,
the cost of fsync'ing each object file can become prohibitive.

One major source of the cost of fsync is the implied flush of the
hardware writeback cache within the disk drive. This commit introduces
a new `core.fsyncMethod=batch` option that batches up hardware flushes.
It hooks into the bulk-checkin plugging and unplugging functionality,
takes advantage of tmp-objdir, and uses the writeout-only support code.

When the new mode is enabled, we do the following for each new object:
1. Create the object in a tmp-objdir.
2. Issue a pagecache writeback request and wait for it to complete.

At the end of the entire transaction when unplugging bulk checkin:
1. Issue an fsync against a dummy file to flush the hardware writeback
   cache, which should by now have seen the tmp-objdir writes.
2. Rename all of the tmp-objdir files to their final names.
3. When updating the index and/or refs, we assume that Git will issue
   another fsync internal to that operation. This is not the default
   today, but the user now has the option of syncing the index and there
   is a separate patch series to implement syncing of refs.

On a filesystem with a singular journal that is updated during name
operations (e.g. create, link, rename, etc), such as NTFS, HFS+, or XFS
we would expect the fsync to trigger a journal writeout so that this
sequence is enough to ensure that the user's data is durable by the time
the git command returns.

Batch mode is only enabled if core.fsyncObjectFiles is false or unset.

_Performance numbers_:

Linux - Hyper-V VM running Kernel 5.11 (Ubuntu 20.04) on a fast SSD.
Mac - macOS 11.5.1 running on a Mac mini on a 1TB Apple SSD.
Windows - Same host as Linux, a preview version of Windows 11.

Adding 500 files to the repo with 'git add' Times reported in seconds.

object file syncing | Linux | Mac   | Windows
--------------------|-------|-------|--------
           disabled | 0.06  |  0.35 | 0.61
              fsync | 1.88  | 11.18 | 2.47
              batch | 0.15  |  0.41 | 1.53

Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
---
 Documentation/config/core.txt |  7 ++++
 bulk-checkin.c                | 70 +++++++++++++++++++++++++++++++++++
 bulk-checkin.h                |  2 +
 cache.h                       |  8 +++-
 config.c                      |  2 +
 object-file.c                 |  2 +
 6 files changed, 90 insertions(+), 1 deletion(-)

Comments

Ævar Arnfjörð Bjarmason March 21, 2022, 2:41 p.m. UTC | #1
On Sun, Mar 20 2022, Neeraj Singh via GitGitGadget wrote:

> From: Neeraj Singh <neerajsi@microsoft.com>
> [...]
> +	if (batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT)) {
> +		bulk_fsync_objdir = tmp_objdir_create("bulk-fsync");
> +		if (!bulk_fsync_objdir)
> +			die(_("Could not create temporary object directory for core.fsyncobjectfiles=batch"));

Should camel-case the config var, and we should have a die_errno() here
which tell us why we couldn't create it (possibly needing to ferry it up
from the tempfile API...)
Ævar Arnfjörð Bjarmason March 21, 2022, 3:47 p.m. UTC | #2
On Sun, Mar 20 2022, Neeraj Singh via GitGitGadget wrote:

> From: Neeraj Singh <neerajsi@microsoft.com>
>
> One major source of the cost of fsync is the implied flush of the
> hardware writeback cache within the disk drive. This commit introduces
> a new `core.fsyncMethod=batch` option that batches up hardware flushes.
> It hooks into the bulk-checkin plugging and unplugging functionality,
> takes advantage of tmp-objdir, and uses the writeout-only support code.
>
> When the new mode is enabled, we do the following for each new object:
> 1. Create the object in a tmp-objdir.
> 2. Issue a pagecache writeback request and wait for it to complete.
>
> At the end of the entire transaction when unplugging bulk checkin:
> 1. Issue an fsync against a dummy file to flush the hardware writeback
>    cache, which should by now have seen the tmp-objdir writes.
> 2. Rename all of the tmp-objdir files to their final names.
> 3. When updating the index and/or refs, we assume that Git will issue
>    another fsync internal to that operation. This is not the default
>    today, but the user now has the option of syncing the index and there
>    is a separate patch series to implement syncing of refs.

Re my question in
https://lore.kernel.org/git/220310.86r179ki38.gmgdl@evledraar.gmail.com/
(which you *partially* replied to per my reading, i.e. not the
fsync_nth() question) I still don't get why the tmp-objdir part of this
is needed.

For "git stash" which is one thing sped up by this let's go over what
commands/FS ops we do. I changed the test like this:
	
	diff --git a/t/t3903-stash.sh b/t/t3903-stash.sh
	index 3fc16944e9e..479a495c68c 100755
	--- a/t/t3903-stash.sh
	+++ b/t/t3903-stash.sh
	@@ -1383,7 +1383,7 @@ BATCH_CONFIGURATION='-c core.fsync=loose-object -c core.fsyncmethod=batch'
	 
	 test_expect_success 'stash with core.fsyncmethod=batch' "
	 	test_create_unique_files 2 4 fsync-files &&
	-	git $BATCH_CONFIGURATION stash push -u -- ./fsync-files/ &&
	+	strace -f git $BATCH_CONFIGURATION stash push -u -- ./fsync-files/ &&
	 	rm -f fsynced_files &&
	 
	 	# The files were untracked, so use the third parent,
	
Then we get this output, with my comments, and I snipped some output:
	 
	$ ./t3903-stash.sh --run=1-4,114 -vixd 2>&1|grep --color -e 89772c935031c228ed67890f9 -e .git/stash -e bulk_fsync -e .git/index
	[pid 14703] access(".git/objects/tmp_objdir-bulk-fsync-rR3AQI/fb/89772c935031c228ed67890f953c0a2b5c8316", F_OK) = -1 ENOENT (No such file or directory)
	[pid 14703] access(".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316", F_OK) = -1 ENOENT (No such file or directory)
	[pid 14703] link(".git/objects/tmp_objdir-bulk-fsync-rR3AQI/fb/tmp_obj_bdUlzu", ".git/objects/tmp_objdir-bulk-fsync-rR3AQI/fb/89772c935031c228ed67890f953c0a2b5c8316") = 0

Here we're creating the tmp_objdir() files. We then sync_file_range()
and close() this.

	[pid 14703] openat(AT_FDCWD, "/home/avar/g/git/t/trash directory.t3903-stash/.git/objects/tmp_objdir-bulk-fsync-rR3AQI/bulk_fsync_HsDRl7", O_RDWR|O_CREAT|O_EXCL, 0600) = 9
	[pid 14703] unlink("/home/avar/g/git/t/trash directory.t3903-stash/.git/objects/tmp_objdir-bulk-fsync-rR3AQI/bulk_fsync_HsDRl7") = 0

This is the flushing of the "cookie" in do_batch_fsync().

	[pid 14703] newfstatat(AT_FDCWD, ".git/objects/tmp_objdir-bulk-fsync-rR3AQI/fb/89772c935031c228ed67890f953c0a2b5c8316", {st_mode=S_IFREG|0444, st_size=29, ...}, 0) = 0
	[pid 14703] link(".git/objects/tmp_objdir-bulk-fsync-rR3AQI/fb/89772c935031c228ed67890f953c0a2b5c8316", ".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316") = 0

Here we're going through the object dir migration with
unplug_bulk_checkin().

	[pid 14703] unlink(".git/objects/tmp_objdir-bulk-fsync-rR3AQI/fb/89772c935031c228ed67890f953c0a2b5c8316") = 0
	newfstatat(AT_FDCWD, ".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316", {st_mode=S_IFREG|0444, st_size=29, ...}, AT_SYMLINK_NOFOLLOW) = 0
	[pid 14705] access(".git/objects/tmp_objdir-bulk-fsync-0F7DGy/fb/89772c935031c228ed67890f953c0a2b5c8316", F_OK) = -1 ENOENT (No such file or directory)
	[pid 14705] access(".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316", F_OK) = 0
	[pid 14705] utimensat(AT_FDCWD, ".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316", NULL, 0) = 0
	[pid 14707] openat(AT_FDCWD, ".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316", O_RDONLY|O_CLOEXEC) = 9

We then update the index itself, first a temporary index.stash :

    openat(AT_FDCWD, "/home/avar/g/git/t/trash directory.t3903-stash/.git/index.stash.19141.lock", O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC, 0666) = 8
    openat(AT_FDCWD, ".git/index.stash.19141", O_RDONLY) = 9
    newfstatat(AT_FDCWD, ".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316", {st_mode=S_IFREG|0444, st_size=29, ...}, AT_SYMLINK_NOFOLLOW) = 0
    newfstatat(AT_FDCWD, "/home/avar/g/git/t/trash directory.t3903-stash/.git/index.stash.19141.lock", {st_mode=S_IFREG|0644, st_size=927, ...}, 0) = 0
    rename("/home/avar/g/git/t/trash directory.t3903-stash/.git/index.stash.19141.lock", "/home/avar/g/git/t/trash directory.t3903-stash/.git/index.stash.19141") = 0
    unlink(".git/index.stash.19141")        = 0

Followed by the same and a later rename of the actual index:

    [pid 19146] rename("/home/avar/g/git/t/trash directory.t3903-stash/.git/index.lock", "/home/avar/g/git/t/trash directory.t3903-stash/.git/index") = 0

So, my question is still why the temporary object dir migration part of
this is needed.

We are writing N loose object files, and we write those to temporary
names already.

AFAIKT we could do all of this by doing the same
tmp/rename/sync_file_range dance on the main object store.

Then instead of the "bulk_fsync" cookie file don't close() the last file
object file we write until we issue the fsync on it.

But maybe this is all needed, I just can't understand from the commit
message why the "bulk checkin" part is being done.

I think since we've been over this a few times without any success it
would really help to have some example of the smallest set of syscalls
to write a file like this safely. I.e. this is doing (pseudocode):

    /* first the bulk path */
    open("bulk/x.tmp");
    write("bulk/x.tmp");
    sync_file_range("bulk/x.tmp");
    close("bulk/x.tmp");
    rename("bulk/x.tmp", "bulk/x");
    open("bulk/y.tmp");
    write("bulk/y.tmp");
    sync_file_range("bulk/y.tmp");
    close("bulk/y.tmp");
    rename("bulk/y.tmp", "bulk/y");
    /* Rename to "real" */
    rename("bulk/x", x");
    rename("bulk/y", y");
    /* sync a cookie */
    fsync("cookie");

And I'm asking why it's not:

    /* Rename to "real" as we go */
    open("x.tmp");
    write("x.tmp");
    sync_file_range("x.tmp");
    close("x.tmp");
    rename("x.tmp", "x");
    last_fd = open("y.tmp"); /* don't close() the last one yet */
    write("y.tmp");
    sync_file_range("y.tmp");
    rename("y.tmp", "y");
    /* sync a cookie */
    fsync(last_fd);

Which I guess is two questions:

 A. do we need the cookie, or can we re-use the fd of the last thing we
    write?
 B. Is the bulk indirection needed?

> +		fsync_or_die(fd, "loose object file");

Unrelated nit: this API is producing sentence lego unfriendly to
translators.

Should be made to take an enum or something, so we can emit the relevant
translated message in fsync_or_die(). Imagine getting:

	fsync error on '日本語は話せません'

Which this will do, just the other way around for non-English speakers
using the translation.

(The solution is also not to add _() here, since translators will want
to control the word order.)

> diff --git a/cache.h b/cache.h
> index 3160bc1e489..d1ae51388c9 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -1040,7 +1040,8 @@ extern int use_fsync;
>  
>  enum fsync_method {
>  	FSYNC_METHOD_FSYNC,
> -	FSYNC_METHOD_WRITEOUT_ONLY
> +	FSYNC_METHOD_WRITEOUT_ONLY,
> +	FSYNC_METHOD_BATCH
>  };
>  
>  extern enum fsync_method fsync_method;
> @@ -1767,6 +1768,11 @@ void fsync_or_die(int fd, const char *);
>  int fsync_component(enum fsync_component component, int fd);
>  void fsync_component_or_die(enum fsync_component component, int fd, const char *msg);
>  
> +static inline int batch_fsync_enabled(enum fsync_component component)
> +{
> +	return (fsync_components & component) && (fsync_method == FSYNC_METHOD_BATCH);
> +}
> +
>  ssize_t read_in_full(int fd, void *buf, size_t count);
>  ssize_t write_in_full(int fd, const void *buf, size_t count);
>  ssize_t pread_in_full(int fd, void *buf, size_t count, off_t offset);
> diff --git a/config.c b/config.c
> index 261ee7436e0..0b28f90de8b 100644
> --- a/config.c
> +++ b/config.c
> @@ -1688,6 +1688,8 @@ static int git_default_core_config(const char *var, const char *value, void *cb)
>  			fsync_method = FSYNC_METHOD_FSYNC;
>  		else if (!strcmp(value, "writeout-only"))
>  			fsync_method = FSYNC_METHOD_WRITEOUT_ONLY;
> +		else if (!strcmp(value, "batch"))
> +			fsync_method = FSYNC_METHOD_BATCH;
>  		else
>  			warning(_("ignoring unknown core.fsyncMethod value '%s'"), value);
>  
> diff --git a/object-file.c b/object-file.c
> index 5258d9ed827..bdb0a38328f 100644
> --- a/object-file.c
> +++ b/object-file.c
> @@ -1895,6 +1895,8 @@ static void close_loose_object(int fd)
>  
>  	if (fsync_object_files > 0)
>  		fsync_or_die(fd, "loose object file");
> +	else if (batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT))
> +		fsync_loose_object_bulk_checkin(fd);
>  	else
>  		fsync_component_or_die(FSYNC_COMPONENT_LOOSE_OBJECT, fd,
>  				       "loose object file");

This is related to the above comments about what minimum set of syscalls
are needed to trigger this "bulk" behavior, but it seems to me that this
whole API is avoiding just passing some new flags down to object-file.c
and friends.

For e.g. update-index that results in e.g. the "plug bulk" not being
aware of HASH_WRITE_OBJECT, so with dry-run writes and the like we'll do
the whole setup/teardown for nothing.

Which is another reason I wondered why this couldn't be a flagged passed
down to the object writing...
Junio C Hamano March 21, 2022, 5:30 p.m. UTC | #3
"Neeraj Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +* `batch` enables a mode that uses writeout-only flushes to stage multiple
> +  updates in the disk writeback cache and then does a single full fsync of
> +  a dummy file to trigger the disk cache flush at the end of the operation.

It is unfortunate that we have a rather independent "unplug" that is
not tied to the "this is the last operation in the batch"---if there
were we didn't have to invent a dummy but a single full sync on the
real file who happened to be the last one in the batch would be
sufficient.  It would not matter, if the batch is any meaningful
size, hopefully.

> +/*
> + * Cleanup after batch-mode fsync_object_files.
> + */
> +static void do_batch_fsync(void)
> +{
> +	/*
> +	 * Issue a full hardware flush against a temporary file to ensure
> +	 * that all objects are durable before any renames occur.  The code in
> +	 * fsync_loose_object_bulk_checkin has already issued a writeout
> +	 * request, but it has not flushed any writeback cache in the storage
> +	 * hardware.
> +	 */
> +
> +	if (needs_batch_fsync) {
> +		struct strbuf temp_path = STRBUF_INIT;
> +		struct tempfile *temp;
> +
> +		strbuf_addf(&temp_path, "%s/bulk_fsync_XXXXXX", get_object_directory());
> +		temp = xmks_tempfile(temp_path.buf);
> +		fsync_or_die(get_tempfile_fd(temp), get_tempfile_path(temp));
> +		delete_tempfile(&temp);
> +		strbuf_release(&temp_path);
> +		needs_batch_fsync = 0;
> +	}
> +
> +	if (bulk_fsync_objdir) {
> +		tmp_objdir_migrate(bulk_fsync_objdir);
> +		bulk_fsync_objdir = NULL;

The struct obtained from tmp_objdir_create() is consumed by
tmp_objdir_migrate() so the only clean-up left for the caller to do
is to clear it to NULL.  OK.

> +	}

This initially made me wonder why we need two independent flags.
After applying this patch but not any later steps, upon plugging, we
create the tentative object directory, and any loose object will be
created there, but because nobody calls the writeout-only variant
via fsync_loose_object_bulk_checkin() yet, needs_batch_fsync may not
be turned on.  But even in that case, any new loose objects are in
the tentative object directory and need to be migrated to the real
place.

And we may not cover all the existing code paths at the end of the
series, or any new code paths right away after they get introduced,
to be aware of the fsync_loose_object_bulk_checkin() when they
create a loose object file, so it is most likely that these two if
statements will be with us forever.

OK.

> @@ -274,6 +311,24 @@ static int deflate_to_pack(struct bulk_checkin_state *state,
>  	return 0;
>  }
>  
> +void fsync_loose_object_bulk_checkin(int fd)
> +{
> +	/*
> +	 * If we have a plugged bulk checkin, we issue a call that
> +	 * cleans the filesystem page cache but avoids a hardware flush
> +	 * command. Later on we will issue a single hardware flush
> +	 * before as part of do_batch_fsync.
> +	 */
> +	if (bulk_checkin_plugged &&
> +	    git_fsync(fd, FSYNC_WRITEOUT_ONLY) >= 0) {
> +		assert(bulk_fsync_objdir);
> +		if (!needs_batch_fsync)
> +			needs_batch_fsync = 1;

Except for when we unplug, do we ever flip needs_batch_fsync bit
off, once it is set?  If the answer is no, wouldn't it be clearer to
unconditionally set it, instead of "set it only for the first time"?

> +	} else {
> +		fsync_or_die(fd, "loose object file");
> +	}
> +}
> +
>  int index_bulk_checkin(struct object_id *oid,
>  		       int fd, size_t size, enum object_type type,
>  		       const char *path, unsigned flags)
> @@ -288,6 +343,19 @@ int index_bulk_checkin(struct object_id *oid,
>  void plug_bulk_checkin(void)
>  {
>  	assert(!bulk_checkin_plugged);
> +
> +	/*
> +	 * A temporary object directory is used to hold the files
> +	 * while they are not fsynced.
> +	 */
> +	if (batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT)) {
> +		bulk_fsync_objdir = tmp_objdir_create("bulk-fsync");
> +		if (!bulk_fsync_objdir)
> +			die(_("Could not create temporary object directory for core.fsyncobjectfiles=batch"));
> +
> +		tmp_objdir_replace_primary_odb(bulk_fsync_objdir, 0);
> +	}
> +
>  	bulk_checkin_plugged = 1;
>  }
>  
> @@ -297,4 +365,6 @@ void unplug_bulk_checkin(void)
>  	bulk_checkin_plugged = 0;
>  	if (bulk_checkin_state.f)
>  		finish_bulk_checkin(&bulk_checkin_state);
> +
> +	do_batch_fsync();
>  }
> diff --git a/bulk-checkin.h b/bulk-checkin.h
> index b26f3dc3b74..08f292379b6 100644
> --- a/bulk-checkin.h
> +++ b/bulk-checkin.h
> @@ -6,6 +6,8 @@
>  
>  #include "cache.h"
>  
> +void fsync_loose_object_bulk_checkin(int fd);
> +
>  int index_bulk_checkin(struct object_id *oid,
>  		       int fd, size_t size, enum object_type type,
>  		       const char *path, unsigned flags);
> diff --git a/cache.h b/cache.h
> index 3160bc1e489..d1ae51388c9 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -1040,7 +1040,8 @@ extern int use_fsync;
>  
>  enum fsync_method {
>  	FSYNC_METHOD_FSYNC,
> -	FSYNC_METHOD_WRITEOUT_ONLY
> +	FSYNC_METHOD_WRITEOUT_ONLY,
> +	FSYNC_METHOD_BATCH
>  };

Style.

These days we allow trailing comma to enum definitions.  Perhaps
give a trailing comma after _BATCH so that the next update patch
will become less noisy?

Thanks.
Neeraj Singh March 21, 2022, 6:28 p.m. UTC | #4
On Mon, Mar 21, 2022 at 7:43 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Sun, Mar 20 2022, Neeraj Singh via GitGitGadget wrote:
>
> > From: Neeraj Singh <neerajsi@microsoft.com>
> > [...]
> > +     if (batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT)) {
> > +             bulk_fsync_objdir = tmp_objdir_create("bulk-fsync");
> > +             if (!bulk_fsync_objdir)
> > +                     die(_("Could not create temporary object directory for core.fsyncobjectfiles=batch"));
>
> Should camel-case the config var, and we should have a die_errno() here
> which tell us why we couldn't create it (possibly needing to ferry it up
> from the tempfile API...)

Thanks for noticing the camelCasing.  The config var name was also
wrong. Now it will read:
> > +                     die(_("Could not create temporary object directory for core.fsyncMethod=batch"));

Do you have any recommendations on how to easily ferry the correct
errno out of tmp_objdir_create?
It looks like the remerge-diff usage has the same die behavior w/o
errno, and the builtin/receive-pack.c usage
doesn't die, but also loses the errno.  I'm concerned about preserving
the errno across the or tmp_objdir_destroy
calls.  I could introduce a temp errno var to preserve it across
those. Is that what you had in mind?

Thanks,
Neeraj
Neeraj Singh March 21, 2022, 8:14 p.m. UTC | #5
On Mon, Mar 21, 2022 at 9:52 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Sun, Mar 20 2022, Neeraj Singh via GitGitGadget wrote:
>
> > From: Neeraj Singh <neerajsi@microsoft.com>
> >
> > One major source of the cost of fsync is the implied flush of the
> > hardware writeback cache within the disk drive. This commit introduces
> > a new `core.fsyncMethod=batch` option that batches up hardware flushes.
> > It hooks into the bulk-checkin plugging and unplugging functionality,
> > takes advantage of tmp-objdir, and uses the writeout-only support code.
> >
> > When the new mode is enabled, we do the following for each new object:
> > 1. Create the object in a tmp-objdir.
> > 2. Issue a pagecache writeback request and wait for it to complete.
> >
> > At the end of the entire transaction when unplugging bulk checkin:
> > 1. Issue an fsync against a dummy file to flush the hardware writeback
> >    cache, which should by now have seen the tmp-objdir writes.
> > 2. Rename all of the tmp-objdir files to their final names.
> > 3. When updating the index and/or refs, we assume that Git will issue
> >    another fsync internal to that operation. This is not the default
> >    today, but the user now has the option of syncing the index and there
> >    is a separate patch series to implement syncing of refs.
>
> Re my question in
> https://lore.kernel.org/git/220310.86r179ki38.gmgdl@evledraar.gmail.com/
> (which you *partially* replied to per my reading, i.e. not the
> fsync_nth() question) I still don't get why the tmp-objdir part of this
> is needed.
>

Sorry for not fully answering your question. I think part of the issue might be
background, where it's not clear to me what's different between your
understanding
and mine, so may not have included something that's questionable to
you but not to me.

Your syscall description below makes the issues very concrete, so I
think we'll get it this round :).

> For "git stash" which is one thing sped up by this let's go over what
> commands/FS ops we do. I changed the test like this:
>
>         diff --git a/t/t3903-stash.sh b/t/t3903-stash.sh
>         index 3fc16944e9e..479a495c68c 100755
>         --- a/t/t3903-stash.sh
>         +++ b/t/t3903-stash.sh
>         @@ -1383,7 +1383,7 @@ BATCH_CONFIGURATION='-c core.fsync=loose-object -c core.fsyncmethod=batch'
>
>          test_expect_success 'stash with core.fsyncmethod=batch' "
>                 test_create_unique_files 2 4 fsync-files &&
>         -       git $BATCH_CONFIGURATION stash push -u -- ./fsync-files/ &&
>         +       strace -f git $BATCH_CONFIGURATION stash push -u -- ./fsync-files/ &&
>                 rm -f fsynced_files &&
>
>                 # The files were untracked, so use the third parent,
>
> Then we get this output, with my comments, and I snipped some output:
>
>         $ ./t3903-stash.sh --run=1-4,114 -vixd 2>&1|grep --color -e 89772c935031c228ed67890f9 -e .git/stash -e bulk_fsync -e .git/index
>         [pid 14703] access(".git/objects/tmp_objdir-bulk-fsync-rR3AQI/fb/89772c935031c228ed67890f953c0a2b5c8316", F_OK) = -1 ENOENT (No such file or directory)
>         [pid 14703] access(".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316", F_OK) = -1 ENOENT (No such file or directory)
>         [pid 14703] link(".git/objects/tmp_objdir-bulk-fsync-rR3AQI/fb/tmp_obj_bdUlzu", ".git/objects/tmp_objdir-bulk-fsync-rR3AQI/fb/89772c935031c228ed67890f953c0a2b5c8316") = 0
>
> Here we're creating the tmp_objdir() files. We then sync_file_range()
> and close() this.
>
>         [pid 14703] openat(AT_FDCWD, "/home/avar/g/git/t/trash directory.t3903-stash/.git/objects/tmp_objdir-bulk-fsync-rR3AQI/bulk_fsync_HsDRl7", O_RDWR|O_CREAT|O_EXCL, 0600) = 9
>         [pid 14703] unlink("/home/avar/g/git/t/trash directory.t3903-stash/.git/objects/tmp_objdir-bulk-fsync-rR3AQI/bulk_fsync_HsDRl7") = 0
>
> This is the flushing of the "cookie" in do_batch_fsync().
>
>         [pid 14703] newfstatat(AT_FDCWD, ".git/objects/tmp_objdir-bulk-fsync-rR3AQI/fb/89772c935031c228ed67890f953c0a2b5c8316", {st_mode=S_IFREG|0444, st_size=29, ...}, 0) = 0
>         [pid 14703] link(".git/objects/tmp_objdir-bulk-fsync-rR3AQI/fb/89772c935031c228ed67890f953c0a2b5c8316", ".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316") = 0
>
> Here we're going through the object dir migration with
> unplug_bulk_checkin().
>
>         [pid 14703] unlink(".git/objects/tmp_objdir-bulk-fsync-rR3AQI/fb/89772c935031c228ed67890f953c0a2b5c8316") = 0
>         newfstatat(AT_FDCWD, ".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316", {st_mode=S_IFREG|0444, st_size=29, ...}, AT_SYMLINK_NOFOLLOW) = 0
>         [pid 14705] access(".git/objects/tmp_objdir-bulk-fsync-0F7DGy/fb/89772c935031c228ed67890f953c0a2b5c8316", F_OK) = -1 ENOENT (No such file or directory)
>         [pid 14705] access(".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316", F_OK) = 0
>         [pid 14705] utimensat(AT_FDCWD, ".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316", NULL, 0) = 0
>         [pid 14707] openat(AT_FDCWD, ".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316", O_RDONLY|O_CLOEXEC) = 9
>
> We then update the index itself, first a temporary index.stash :
>
>     openat(AT_FDCWD, "/home/avar/g/git/t/trash directory.t3903-stash/.git/index.stash.19141.lock", O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC, 0666) = 8
>     openat(AT_FDCWD, ".git/index.stash.19141", O_RDONLY) = 9
>     newfstatat(AT_FDCWD, ".git/objects/fb/89772c935031c228ed67890f953c0a2b5c8316", {st_mode=S_IFREG|0444, st_size=29, ...}, AT_SYMLINK_NOFOLLOW) = 0
>     newfstatat(AT_FDCWD, "/home/avar/g/git/t/trash directory.t3903-stash/.git/index.stash.19141.lock", {st_mode=S_IFREG|0644, st_size=927, ...}, 0) = 0
>     rename("/home/avar/g/git/t/trash directory.t3903-stash/.git/index.stash.19141.lock", "/home/avar/g/git/t/trash directory.t3903-stash/.git/index.stash.19141") = 0
>     unlink(".git/index.stash.19141")        = 0
>
> Followed by the same and a later rename of the actual index:
>
>     [pid 19146] rename("/home/avar/g/git/t/trash directory.t3903-stash/.git/index.lock", "/home/avar/g/git/t/trash directory.t3903-stash/.git/index") = 0
>
> So, my question is still why the temporary object dir migration part of
> this is needed.
>
> We are writing N loose object files, and we write those to temporary
> names already.
>
> AFAIKT we could do all of this by doing the same
> tmp/rename/sync_file_range dance on the main object store.
>

Why not the main object store? We want to maintain the invariant that any
name in the main object store refers to a file that durably has the
correct contents.
If we do sync_file_range and then rename, and then crash, we now have a file
in the main object store with some SHA name, whose contents may or may not
match the SHA.  However, if we ensure an fsync happens before the rename,
a crash at any point will leave us either with no file in the main
object store or
with a file that is durable on the disk.

> Then instead of the "bulk_fsync" cookie file don't close() the last file
> object file we write until we issue the fsync on it.
>
> But maybe this is all needed, I just can't understand from the commit
> message why the "bulk checkin" part is being done.
>
> I think since we've been over this a few times without any success it
> would really help to have some example of the smallest set of syscalls
> to write a file like this safely. I.e. this is doing (pseudocode):
>
>     /* first the bulk path */
>     open("bulk/x.tmp");
>     write("bulk/x.tmp");
>     sync_file_range("bulk/x.tmp");
>     close("bulk/x.tmp");
>     rename("bulk/x.tmp", "bulk/x");
>     open("bulk/y.tmp");
>     write("bulk/y.tmp");
>     sync_file_range("bulk/y.tmp");
>     close("bulk/y.tmp");
>     rename("bulk/y.tmp", "bulk/y");
>     /* Rename to "real" */
>     rename("bulk/x", x");
>     rename("bulk/y", y");
>     /* sync a cookie */
>     fsync("cookie");
>

The '/* Rename to "real" */' and '/* sync a cookie */' steps are
reversed in your above sequence. It should be
1: (for each file)
    a) open
    b) write
    c) sync_file_range
    d) close
    e) rename in tmp_objdir  -- The rename step is not required for
bulk-fsync. An earlier version of this series didn't do it, but
Jeff King pointed out that it was required for concurrency:
https://lore.kernel.org/all/YVOrikAl%2Fu5%2FVi61@coredump.intra.peff.net/

2: fsync something on the same volume to flush the filesystem log and
disk cache. This functions as a "barrier".
3: Rename to final names.  At this point we know that the "contents"
are durable, so if the final name exists, we can read through it to
get the data.

> And I'm asking why it's not:
>
>     /* Rename to "real" as we go */
>     open("x.tmp");
>     write("x.tmp");
>     sync_file_range("x.tmp");
>     close("x.tmp");
>     rename("x.tmp", "x");
>     last_fd = open("y.tmp"); /* don't close() the last one yet */
>     write("y.tmp");
>     sync_file_range("y.tmp");
>     rename("y.tmp", "y");
>     /* sync a cookie */
>     fsync(last_fd);
>
> Which I guess is two questions:
>
>  A. do we need the cookie, or can we re-use the fd of the last thing we
>     write?

We can re-use the FD of the last thing we write, but that results in a
tricker API which
is more intrusive on callers. I was originally using a lockfile, but
found a usage where
there was no lockfile in unpack-objects.

>  B. Is the bulk indirection needed?
>

Hopefully the explanation above makes it clear why we need the
indirection. To state it again,
we need a real fsync before creating the final name in the objdir,
otherwise on a crash a name
could exist that points at contents which could have been lost, since
they weren't durable. I
updated the comment in do_batch_fsync to make this a little clearer.

> > +             fsync_or_die(fd, "loose object file");
>
> Unrelated nit: this API is producing sentence lego unfriendly to
> translators.
>
> Should be made to take an enum or something, so we can emit the relevant
> translated message in fsync_or_die(). Imagine getting:
>
>         fsync error on '日本語は話せません'
>
> Which this will do, just the other way around for non-English speakers
> using the translation.
>
> (The solution is also not to add _() here, since translators will want
> to control the word order.)

This line is copied from the preexisting version of the same code in
close_loose_object.
If I'm understanding it correctly, the entire chain of messages is
untranslated and would
remain as english.  fsync_or_die doesn't have a _().  Can we just
leave it that way, since
this is not a situation that should actually happen to many users?
Alternatively, I think it
would be pretty trivial to just pass through the file name, so I'll
just do that.

> > diff --git a/cache.h b/cache.h
> > index 3160bc1e489..d1ae51388c9 100644
> > --- a/cache.h
> > +++ b/cache.h
> > @@ -1040,7 +1040,8 @@ extern int use_fsync;
> >
> >  enum fsync_method {
> >       FSYNC_METHOD_FSYNC,
> > -     FSYNC_METHOD_WRITEOUT_ONLY
> > +     FSYNC_METHOD_WRITEOUT_ONLY,
> > +     FSYNC_METHOD_BATCH
> >  };
> >
> >  extern enum fsync_method fsync_method;
> > @@ -1767,6 +1768,11 @@ void fsync_or_die(int fd, const char *);
> >  int fsync_component(enum fsync_component component, int fd);
> >  void fsync_component_or_die(enum fsync_component component, int fd, const char *msg);
> >
> > +static inline int batch_fsync_enabled(enum fsync_component component)
> > +{
> > +     return (fsync_components & component) && (fsync_method == FSYNC_METHOD_BATCH);
> > +}
> > +
> >  ssize_t read_in_full(int fd, void *buf, size_t count);
> >  ssize_t write_in_full(int fd, const void *buf, size_t count);
> >  ssize_t pread_in_full(int fd, void *buf, size_t count, off_t offset);
> > diff --git a/config.c b/config.c
> > index 261ee7436e0..0b28f90de8b 100644
> > --- a/config.c
> > +++ b/config.c
> > @@ -1688,6 +1688,8 @@ static int git_default_core_config(const char *var, const char *value, void *cb)
> >                       fsync_method = FSYNC_METHOD_FSYNC;
> >               else if (!strcmp(value, "writeout-only"))
> >                       fsync_method = FSYNC_METHOD_WRITEOUT_ONLY;
> > +             else if (!strcmp(value, "batch"))
> > +                     fsync_method = FSYNC_METHOD_BATCH;
> >               else
> >                       warning(_("ignoring unknown core.fsyncMethod value '%s'"), value);
> >
> > diff --git a/object-file.c b/object-file.c
> > index 5258d9ed827..bdb0a38328f 100644
> > --- a/object-file.c
> > +++ b/object-file.c
> > @@ -1895,6 +1895,8 @@ static void close_loose_object(int fd)
> >
> >       if (fsync_object_files > 0)
> >               fsync_or_die(fd, "loose object file");
> > +     else if (batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT))
> > +             fsync_loose_object_bulk_checkin(fd);
> >       else
> >               fsync_component_or_die(FSYNC_COMPONENT_LOOSE_OBJECT, fd,
> >                                      "loose object file");
>
> This is related to the above comments about what minimum set of syscalls
> are needed to trigger this "bulk" behavior, but it seems to me that this
> whole API is avoiding just passing some new flags down to object-file.c
> and friends.
>
> For e.g. update-index that results in e.g. the "plug bulk" not being
> aware of HASH_WRITE_OBJECT, so with dry-run writes and the like we'll do
> the whole setup/teardown for nothing.
>
> Which is another reason I wondered why this couldn't be a flagged passed
> down to the object writing...

In the original implementation [1], I did some custom thing for
renaming the files
rather than tmp_objdir. But you suggested at the time that I use
tmp_objdir, which
was a good decision, since it made access to the objects possible in-process and
for descendents in the middle of the transaction.

It sounds to me like I just shouldn't plug the bulk checkin for cases
where we're not
going to add to the ODB.  Plugging the bulk checkin is always
optional.  But when
I wrote the code, I didn't love the result, since it makes arbitrary
callers harder. So
I changed the code to lazily create the tmp objdir the first time an object
shows up, which has the same effect of avoiding the cost when we
aren't adding any
objects.  This also avoids the need to write an error message, since
failing to create
the tmp objdir will just result in a fsync.  The main downside here is
that it's another thing
that will have to change if we want to make adding to the ODB multithreaded.

Thanks,
Neeraj

[1] https://lore.kernel.org/all/12cad737635663ed596e52f89f0f4f22f58bfe38.1632176111.git.gitgitgadget@gmail.com/
Ævar Arnfjörð Bjarmason March 21, 2022, 8:18 p.m. UTC | #6
On Mon, Mar 21 2022, Neeraj Singh wrote:

[Don't have time for a full reply, sorry, just something quick]

> On Mon, Mar 21, 2022 at 9:52 AM Ævar Arnfjörð Bjarmason
> [...]
>> So, my question is still why the temporary object dir migration part of
>> this is needed.
>>
>> We are writing N loose object files, and we write those to temporary
>> names already.
>>
>> AFAIKT we could do all of this by doing the same
>> tmp/rename/sync_file_range dance on the main object store.
>>
>
> Why not the main object store? We want to maintain the invariant that any
> name in the main object store refers to a file that durably has the
> correct contents.
> If we do sync_file_range and then rename, and then crash, we now have a file
> in the main object store with some SHA name, whose contents may or may not
> match the SHA.  However, if we ensure an fsync happens before the rename,
> a crash at any point will leave us either with no file in the main
> object store or
> with a file that is durable on the disk.

Ah, I see.

Why does that matter? If the "bulk" mode works as advertised we might
have such a corrupt loose or pack file, but we won't have anything
referring to it as far as reachability goes.

I'm aware that the various code paths that handle OID writing don't deal
too well with it in practice to say the least, which one can try with
say:

    $ echo foo | git hash-object -w --stdin
    45b983be36b73c0788dc9cbcb76cbb80fc7bb057
    $ echo | sudo tee .git/objects/45/b983be36b73c0788dc9cbcb76cbb80fc7bb057

I.e. "fsck", "show" etc. will all scream bloddy murder, and re-running
that hash-object again even returns successful (we see it's there
already, and think it's OK).

But in any case, I think it would me much easier to both review and
reason about this code if these concerns were split up.

I.e. things that want no fsync at all (I'd think especially so) might
want to have such updates serialized in this manner, and as Junio
pointed out making these things inseparable as you've done creates API
concerns & fallout that's got nothing to do with what we need for the
performance gains of the bulk checkin fsyncing technique,
e.g. concurrent "update-index" consumers not being able to assume
reported objects exist as soon as they're reported.

>> Then instead of the "bulk_fsync" cookie file don't close() the last file
>> object file we write until we issue the fsync on it.
>>
>> But maybe this is all needed, I just can't understand from the commit
>> message why the "bulk checkin" part is being done.
>>
>> I think since we've been over this a few times without any success it
>> would really help to have some example of the smallest set of syscalls
>> to write a file like this safely. I.e. this is doing (pseudocode):
>>
>>     /* first the bulk path */
>>     open("bulk/x.tmp");
>>     write("bulk/x.tmp");
>>     sync_file_range("bulk/x.tmp");
>>     close("bulk/x.tmp");
>>     rename("bulk/x.tmp", "bulk/x");
>>     open("bulk/y.tmp");
>>     write("bulk/y.tmp");
>>     sync_file_range("bulk/y.tmp");
>>     close("bulk/y.tmp");
>>     rename("bulk/y.tmp", "bulk/y");
>>     /* Rename to "real" */
>>     rename("bulk/x", x");
>>     rename("bulk/y", y");
>>     /* sync a cookie */
>>     fsync("cookie");
>>
>
> The '/* Rename to "real" */' and '/* sync a cookie */' steps are
> reversed in your above sequence. It should be

Sorry.

> 1: (for each file)
>     a) open
>     b) write
>     c) sync_file_range
>     d) close
>     e) rename in tmp_objdir  -- The rename step is not required for
> bulk-fsync. An earlier version of this series didn't do it, but
> Jeff King pointed out that it was required for concurrency:
> https://lore.kernel.org/all/YVOrikAl%2Fu5%2FVi61@coredump.intra.peff.net/

Yes we definitely need the rename, I was wondering about why we needed
it 2x for each file, but that was answered above.

>> And I'm asking why it's not:
>>
>>     /* Rename to "real" as we go */
>>     open("x.tmp");
>>     write("x.tmp");
>>     sync_file_range("x.tmp");
>>     close("x.tmp");
>>     rename("x.tmp", "x");
>>     last_fd = open("y.tmp"); /* don't close() the last one yet */
>>     write("y.tmp");
>>     sync_file_range("y.tmp");
>>     rename("y.tmp", "y");
>>     /* sync a cookie */
>>     fsync(last_fd);
>>
>> Which I guess is two questions:
>>
>>  A. do we need the cookie, or can we re-use the fd of the last thing we
>>     write?
>
> We can re-use the FD of the last thing we write, but that results in a
> tricker API which
> is more intrusive on callers. I was originally using a lockfile, but
> found a usage where
> there was no lockfile in unpack-objects.

Ok, so it's something we could do, but passing down 2-3 functions to
object-file.c was a hassle.

I tried to hack that up earlier and found that it wasn't *too
bad*. I.e. we'd pass some "flags" about our intent, and amend various
functions to take "don't close this one" and pass up the fd (or even do
that as a global).

In any case, having the commit message clearly document what's needed
for what & what's essential & just shortcut taken for the convenience of
the current implementation would be really useful.

Then we can always e.g. change this later to just do the the fsync() on
the last of N we write.

[Ran out of time, sorry]
Neeraj Singh March 21, 2022, 8:23 p.m. UTC | #7
On Mon, Mar 21, 2022 at 10:30 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Neeraj Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > +* `batch` enables a mode that uses writeout-only flushes to stage multiple
> > +  updates in the disk writeback cache and then does a single full fsync of
> > +  a dummy file to trigger the disk cache flush at the end of the operation.
>
> It is unfortunate that we have a rather independent "unplug" that is
> not tied to the "this is the last operation in the batch"---if there
> were we didn't have to invent a dummy but a single full sync on the
> real file who happened to be the last one in the batch would be
> sufficient.  It would not matter, if the batch is any meaningful
> size, hopefully.
>

I'm banking on  a large batch size or the fact that the additional
cost of creating
and syncing an empty file to be so small that it wouldn't be
noticeable event for
small batches. The current unfortunate scheme at least has a very simple API
that's easy to apply to any other operation going forward. For instance
builtin/hash-object.c might be another good operation, but it wasn't clear to me
if it's used for any mainline scenario.

> > +/*
> > + * Cleanup after batch-mode fsync_object_files.
> > + */
> > +static void do_batch_fsync(void)
> > +{
> > +     /*
> > +      * Issue a full hardware flush against a temporary file to ensure
> > +      * that all objects are durable before any renames occur.  The code in
> > +      * fsync_loose_object_bulk_checkin has already issued a writeout
> > +      * request, but it has not flushed any writeback cache in the storage
> > +      * hardware.
> > +      */
> > +
> > +     if (needs_batch_fsync) {
> > +             struct strbuf temp_path = STRBUF_INIT;
> > +             struct tempfile *temp;
> > +
> > +             strbuf_addf(&temp_path, "%s/bulk_fsync_XXXXXX", get_object_directory());
> > +             temp = xmks_tempfile(temp_path.buf);
> > +             fsync_or_die(get_tempfile_fd(temp), get_tempfile_path(temp));
> > +             delete_tempfile(&temp);
> > +             strbuf_release(&temp_path);
> > +             needs_batch_fsync = 0;
> > +     }
> > +
> > +     if (bulk_fsync_objdir) {
> > +             tmp_objdir_migrate(bulk_fsync_objdir);
> > +             bulk_fsync_objdir = NULL;
>
> The struct obtained from tmp_objdir_create() is consumed by
> tmp_objdir_migrate() so the only clean-up left for the caller to do
> is to clear it to NULL.  OK.
>
> > +     }
>
> This initially made me wonder why we need two independent flags.
> After applying this patch but not any later steps, upon plugging, we
> create the tentative object directory, and any loose object will be
> created there, but because nobody calls the writeout-only variant
> via fsync_loose_object_bulk_checkin() yet, needs_batch_fsync may not
> be turned on.  But even in that case, any new loose objects are in
> the tentative object directory and need to be migrated to the real
> place.
>
> And we may not cover all the existing code paths at the end of the
> series, or any new code paths right away after they get introduced,
> to be aware of the fsync_loose_object_bulk_checkin() when they
> create a loose object file, so it is most likely that these two if
> statements will be with us forever.
>
> OK.

After Avarb's last feedback, I've changed this to lazily create the objdir, so
the existence of an objdir is a suitable proxy for there being something worth
syncing. The potential downside is that the lazy-creation would need to be
synchronized if the ODB becomes multithreaded.

>
> > @@ -274,6 +311,24 @@ static int deflate_to_pack(struct bulk_checkin_state *state,
> >       return 0;
> >  }
> >
> > +void fsync_loose_object_bulk_checkin(int fd)
> > +{
> > +     /*
> > +      * If we have a plugged bulk checkin, we issue a call that
> > +      * cleans the filesystem page cache but avoids a hardware flush
> > +      * command. Later on we will issue a single hardware flush
> > +      * before as part of do_batch_fsync.
> > +      */
> > +     if (bulk_checkin_plugged &&
> > +         git_fsync(fd, FSYNC_WRITEOUT_ONLY) >= 0) {
> > +             assert(bulk_fsync_objdir);
> > +             if (!needs_batch_fsync)
> > +                     needs_batch_fsync = 1;
>
> Except for when we unplug, do we ever flip needs_batch_fsync bit
> off, once it is set?  If the answer is no, wouldn't it be clearer to
> unconditionally set it, instead of "set it only for the first time"?
>

This code is now gone. I was stupidly optimizing for a future
multithreaded world
which might never come.

> > +     } else {
> > +             fsync_or_die(fd, "loose object file");
> > +     }
> > +}
> > +
> >  int index_bulk_checkin(struct object_id *oid,
> >                      int fd, size_t size, enum object_type type,
> >                      const char *path, unsigned flags)
> > @@ -288,6 +343,19 @@ int index_bulk_checkin(struct object_id *oid,
> >  void plug_bulk_checkin(void)
> >  {
> >       assert(!bulk_checkin_plugged);
> > +
> > +     /*
> > +      * A temporary object directory is used to hold the files
> > +      * while they are not fsynced.
> > +      */
> > +     if (batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT)) {
> > +             bulk_fsync_objdir = tmp_objdir_create("bulk-fsync");
> > +             if (!bulk_fsync_objdir)
> > +                     die(_("Could not create temporary object directory for core.fsyncobjectfiles=batch"));
> > +
> > +             tmp_objdir_replace_primary_odb(bulk_fsync_objdir, 0);
> > +     }
> > +
> >       bulk_checkin_plugged = 1;
> >  }
> >
> > @@ -297,4 +365,6 @@ void unplug_bulk_checkin(void)
> >       bulk_checkin_plugged = 0;
> >       if (bulk_checkin_state.f)
> >               finish_bulk_checkin(&bulk_checkin_state);
> > +
> > +     do_batch_fsync();
> >  }
> > diff --git a/bulk-checkin.h b/bulk-checkin.h
> > index b26f3dc3b74..08f292379b6 100644
> > --- a/bulk-checkin.h
> > +++ b/bulk-checkin.h
> > @@ -6,6 +6,8 @@
> >
> >  #include "cache.h"
> >
> > +void fsync_loose_object_bulk_checkin(int fd);
> > +
> >  int index_bulk_checkin(struct object_id *oid,
> >                      int fd, size_t size, enum object_type type,
> >                      const char *path, unsigned flags);
> > diff --git a/cache.h b/cache.h
> > index 3160bc1e489..d1ae51388c9 100644
> > --- a/cache.h
> > +++ b/cache.h
> > @@ -1040,7 +1040,8 @@ extern int use_fsync;
> >
> >  enum fsync_method {
> >       FSYNC_METHOD_FSYNC,
> > -     FSYNC_METHOD_WRITEOUT_ONLY
> > +     FSYNC_METHOD_WRITEOUT_ONLY,
> > +     FSYNC_METHOD_BATCH
> >  };
>
> Style.
>
> These days we allow trailing comma to enum definitions.  Perhaps
> give a trailing comma after _BATCH so that the next update patch
> will become less noisy?
>

Fixed.

> Thanks.

Thanks!
-Neeraj
Neeraj Singh March 22, 2022, 12:13 a.m. UTC | #8
On Mon, Mar 21, 2022 at 1:37 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Mon, Mar 21 2022, Neeraj Singh wrote:
>
> [Don't have time for a full reply, sorry, just something quick]
>
> > On Mon, Mar 21, 2022 at 9:52 AM Ævar Arnfjörð Bjarmason
> > [...]
> >> So, my question is still why the temporary object dir migration part of
> >> this is needed.
> >>
> >> We are writing N loose object files, and we write those to temporary
> >> names already.
> >>
> >> AFAIKT we could do all of this by doing the same
> >> tmp/rename/sync_file_range dance on the main object store.
> >>
> >
> > Why not the main object store? We want to maintain the invariant that any
> > name in the main object store refers to a file that durably has the
> > correct contents.
> > If we do sync_file_range and then rename, and then crash, we now have a file
> > in the main object store with some SHA name, whose contents may or may not
> > match the SHA.  However, if we ensure an fsync happens before the rename,
> > a crash at any point will leave us either with no file in the main
> > object store or
> > with a file that is durable on the disk.
>
> Ah, I see.
>
> Why does that matter? If the "bulk" mode works as advertised we might
> have such a corrupt loose or pack file, but we won't have anything
> referring to it as far as reachability goes.
>
> I'm aware that the various code paths that handle OID writing don't deal
> too well with it in practice to say the least, which one can try with
> say:
>
>     $ echo foo | git hash-object -w --stdin
>     45b983be36b73c0788dc9cbcb76cbb80fc7bb057
>     $ echo | sudo tee .git/objects/45/b983be36b73c0788dc9cbcb76cbb80fc7bb057
>
> I.e. "fsck", "show" etc. will all scream bloddy murder, and re-running
> that hash-object again even returns successful (we see it's there
> already, and think it's OK).
>

I was under the impression that in-practice a corrupt loose-object can create
persistent problems in the repo for future commands, since we might not
aggressively verify that an existing file with a certain OID really is
valid when
adding a new instance of the data with the same OID.

If you don't have an fsync barrier before producing the final
content-addressable
name, you can't reason about "this operation happened before that operation,"
so it wouldn't really be valid to say that "we won't have anything
referring to it as far
as reachability goes."

It's entirely possible that you'd have trees pointing to other trees
or blobs that aren't
valid, since data writes can be durable in any order. At this point,
future attempts add
the same blobs or trees might silently drop the updates.  I'm betting that's why
core.fsyncObjectFiles was added in the first place, since someone
observed severe
persistent consequences for this form of corruption.

> But in any case, I think it would me much easier to both review and
> reason about this code if these concerns were split up.
>
> I.e. things that want no fsync at all (I'd think especially so) might
> want to have such updates serialized in this manner, and as Junio
> pointed out making these things inseparable as you've done creates API
> concerns & fallout that's got nothing to do with what we need for the
> performance gains of the bulk checkin fsyncing technique,
> e.g. concurrent "update-index" consumers not being able to assume
> reported objects exist as soon as they're reported.
>

I want to explicitly not respond to this concern. I don't believe this
100 line patch
can be usefully split.

> >> Then instead of the "bulk_fsync" cookie file don't close() the last file
> >> object file we write until we issue the fsync on it.
> >>
> >> But maybe this is all needed, I just can't understand from the commit
> >> message why the "bulk checkin" part is being done.
> >>
> >> I think since we've been over this a few times without any success it
> >> would really help to have some example of the smallest set of syscalls
> >> to write a file like this safely. I.e. this is doing (pseudocode):
> >>
> >>     /* first the bulk path */
> >>     open("bulk/x.tmp");
> >>     write("bulk/x.tmp");
> >>     sync_file_range("bulk/x.tmp");
> >>     close("bulk/x.tmp");
> >>     rename("bulk/x.tmp", "bulk/x");
> >>     open("bulk/y.tmp");
> >>     write("bulk/y.tmp");
> >>     sync_file_range("bulk/y.tmp");
> >>     close("bulk/y.tmp");
> >>     rename("bulk/y.tmp", "bulk/y");
> >>     /* Rename to "real" */
> >>     rename("bulk/x", x");
> >>     rename("bulk/y", y");
> >>     /* sync a cookie */
> >>     fsync("cookie");
> >>
> >
> > The '/* Rename to "real" */' and '/* sync a cookie */' steps are
> > reversed in your above sequence. It should be
>
> Sorry.
>
> > 1: (for each file)
> >     a) open
> >     b) write
> >     c) sync_file_range
> >     d) close
> >     e) rename in tmp_objdir  -- The rename step is not required for
> > bulk-fsync. An earlier version of this series didn't do it, but
> > Jeff King pointed out that it was required for concurrency:
> > https://lore.kernel.org/all/YVOrikAl%2Fu5%2FVi61@coredump.intra.peff.net/
>
> Yes we definitely need the rename, I was wondering about why we needed
> it 2x for each file, but that was answered above.
>
> >> And I'm asking why it's not:
> >>
> >>     /* Rename to "real" as we go */
> >>     open("x.tmp");
> >>     write("x.tmp");
> >>     sync_file_range("x.tmp");
> >>     close("x.tmp");
> >>     rename("x.tmp", "x");
> >>     last_fd = open("y.tmp"); /* don't close() the last one yet */
> >>     write("y.tmp");
> >>     sync_file_range("y.tmp");
> >>     rename("y.tmp", "y");
> >>     /* sync a cookie */
> >>     fsync(last_fd);
> >>
> >> Which I guess is two questions:
> >>
> >>  A. do we need the cookie, or can we re-use the fd of the last thing we
> >>     write?
> >
> > We can re-use the FD of the last thing we write, but that results in a
> > tricker API which
> > is more intrusive on callers. I was originally using a lockfile, but
> > found a usage where
> > there was no lockfile in unpack-objects.
>
> Ok, so it's something we could do, but passing down 2-3 functions to
> object-file.c was a hassle.
>
> I tried to hack that up earlier and found that it wasn't *too
> bad*. I.e. we'd pass some "flags" about our intent, and amend various
> functions to take "don't close this one" and pass up the fd (or even do
> that as a global).
>
> In any case, having the commit message clearly document what's needed
> for what & what's essential & just shortcut taken for the convenience of
> the current implementation would be really useful.
>
> Then we can always e.g. change this later to just do the the fsync() on
> the last of N we write.
>

I left a comment in the (now very long) commit message that indicates the
dummy file is there to make the API simpler.

Thanks,
Neeraj
Ævar Arnfjörð Bjarmason March 22, 2022, 8:52 a.m. UTC | #9
On Mon, Mar 21 2022, Neeraj Singh wrote:

> On Mon, Mar 21, 2022 at 1:37 PM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>
>>
>> On Mon, Mar 21 2022, Neeraj Singh wrote:
>>
>> [Don't have time for a full reply, sorry, just something quick]
>>
>> > On Mon, Mar 21, 2022 at 9:52 AM Ævar Arnfjörð Bjarmason
>> > [...]
>> >> So, my question is still why the temporary object dir migration part of
>> >> this is needed.
>> >>
>> >> We are writing N loose object files, and we write those to temporary
>> >> names already.
>> >>
>> >> AFAIKT we could do all of this by doing the same
>> >> tmp/rename/sync_file_range dance on the main object store.
>> >>
>> >
>> > Why not the main object store? We want to maintain the invariant that any
>> > name in the main object store refers to a file that durably has the
>> > correct contents.
>> > If we do sync_file_range and then rename, and then crash, we now have a file
>> > in the main object store with some SHA name, whose contents may or may not
>> > match the SHA.  However, if we ensure an fsync happens before the rename,
>> > a crash at any point will leave us either with no file in the main
>> > object store or
>> > with a file that is durable on the disk.
>>
>> Ah, I see.
>>
>> Why does that matter? If the "bulk" mode works as advertised we might
>> have such a corrupt loose or pack file, but we won't have anything
>> referring to it as far as reachability goes.
>>
>> I'm aware that the various code paths that handle OID writing don't deal
>> too well with it in practice to say the least, which one can try with
>> say:
>>
>>     $ echo foo | git hash-object -w --stdin
>>     45b983be36b73c0788dc9cbcb76cbb80fc7bb057
>>     $ echo | sudo tee .git/objects/45/b983be36b73c0788dc9cbcb76cbb80fc7bb057
>>
>> I.e. "fsck", "show" etc. will all scream bloddy murder, and re-running
>> that hash-object again even returns successful (we see it's there
>> already, and think it's OK).
>>
>
> I was under the impression that in-practice a corrupt loose-object can create
> persistent problems in the repo for future commands, since we might not
> aggressively verify that an existing file with a certain OID really is
> valid when
> adding a new instance of the data with the same OID.

Yes, it can. As the hash-object case shows we don't even check at all.

For "incoming push" we *will* notice, but will just uselessly error
out.

I actually had some patches a while ago to turn off our own home-grown
SHA-1 collision checking.

It had the nice side effect of making it easier to recover from loose
object corruption, since you could (re-)push the corrupted OID as a
PACK, we wouldn't check (and die) on the bad loose object, and since we
take a PACK over LOOSE we'd recover:
https://lore.kernel.org/git/20181028225023.26427-5-avarab@gmail.com/

> If you don't have an fsync barrier before producing the final
> content-addressable
> name, you can't reason about "this operation happened before that operation,"
> so it wouldn't really be valid to say that "we won't have anything
> referring to it as far
> as reachability goes."

That's correct, but we're discussing a feature that *does have* that
fsync barrier. So if we get an error while writing the loose objects
before the "cookie" fsync we'll presumably error out. That'll then be
followed by an fsync() of whatever makes the objects reachable.

> It's entirely possible that you'd have trees pointing to other trees
> or blobs that aren't
> valid, since data writes can be durable in any order. At this point,
> future attempts add
> the same blobs or trees might silently drop the updates.  I'm betting that's why
> core.fsyncObjectFiles was added in the first place, since someone
> observed severe
> persistent consequences for this form of corruption.

Well, you can see Linus's original rant-as-documentation for why we
added it :) I.e. the original git implementation made some heavy
linux-FS assumption about the order of writes and an fsync() flushing
any previous writes, which wasn't portable.

>> But in any case, I think it would me much easier to both review and
>> reason about this code if these concerns were split up.
>>
>> I.e. things that want no fsync at all (I'd think especially so) might
>> want to have such updates serialized in this manner, and as Junio
>> pointed out making these things inseparable as you've done creates API
>> concerns & fallout that's got nothing to do with what we need for the
>> performance gains of the bulk checkin fsyncing technique,
>> e.g. concurrent "update-index" consumers not being able to assume
>> reported objects exist as soon as they're reported.
>>
>
> I want to explicitly not respond to this concern. I don't believe this
> 100 line patch
> can be usefully split.

Leaving "usefully" aside for a second (since that's subjective), it
clearly "can". I just tried this on top of "seen":

	diff --git a/bulk-checkin.c b/bulk-checkin.c
	index a702e0ff203..9e994c4d6ae 100644
	--- a/bulk-checkin.c
	+++ b/bulk-checkin.c
	@@ -9,15 +9,12 @@
	 #include "pack.h"
	 #include "strbuf.h"
	 #include "string-list.h"
	-#include "tmp-objdir.h"
	 #include "packfile.h"
	 #include "object-store.h"
	 
	 static int bulk_checkin_plugged;
	 static int needs_batch_fsync;
	 
	-static struct tmp_objdir *bulk_fsync_objdir;
	-
	 static struct bulk_checkin_state {
	 	char *pack_tmp_name;
	 	struct hashfile *f;
	@@ -110,11 +107,6 @@ static void do_batch_fsync(void)
	 		strbuf_release(&temp_path);
	 		needs_batch_fsync = 0;
	 	}
	-
	-	if (bulk_fsync_objdir) {
	-		tmp_objdir_migrate(bulk_fsync_objdir);
	-		bulk_fsync_objdir = NULL;
	-	}
	 }
	 
	 static int already_written(struct bulk_checkin_state *state, struct object_id *oid)
	@@ -321,7 +313,6 @@ void fsync_loose_object_bulk_checkin(int fd)
	 	 */
	 	if (bulk_checkin_plugged &&
	 	    git_fsync(fd, FSYNC_WRITEOUT_ONLY) >= 0) {
	-		assert(bulk_fsync_objdir);
	 		if (!needs_batch_fsync)
	 			needs_batch_fsync = 1;
	 	} else {
	@@ -343,19 +334,6 @@ int index_bulk_checkin(struct object_id *oid,
	 void plug_bulk_checkin(void)
	 {
	 	assert(!bulk_checkin_plugged);
	-
	-	/*
	-	 * A temporary object directory is used to hold the files
	-	 * while they are not fsynced.
	-	 */
	-	if (batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT)) {
	-		bulk_fsync_objdir = tmp_objdir_create("bulk-fsync");
	-		if (!bulk_fsync_objdir)
	-			die(_("Could not create temporary object directory for core.fsyncobjectfiles=batch"));
	-
	-		tmp_objdir_replace_primary_odb(bulk_fsync_objdir, 0);
	-	}
	-
	 	bulk_checkin_plugged = 1;
	 }

And then tried running:

    $ GIT_PERF_MAKE_OPTS='CFLAGS=-O3' ./run HEAD~ HEAD -- p3900-stash.sh	

And got:
    
    Test                                              HEAD~              HEAD
    --------------------------------------------------------------------------------------------
    3900.2: stash 500 files (object_fsyncing=false)   0.56(0.08+0.09)    0.60(0.08+0.08) +7.1%
    3900.4: stash 500 files (object_fsyncing=true)    14.50(0.07+0.15)   17.13(0.10+0.12) +18.1%
    3900.6: stash 500 files (object_fsyncing=batch)   1.14(0.08+0.11)    1.03(0.08+0.10) -9.6%

Now, I really don't trust that perf run to say anything except these
being in the same ballpark, but it's clearly going to be a *bit* faster
since we'll be doing fewer IOps.

As to "usefully" I really do get what you're saying that you only find
these useful when you combine the two because you'd like to have 100%
safety, and that's fair enough.

But since we are going to have a knob to turn off fsyncing entirely, and
we have this "bulk" mode which requires you to carefully reason about
your FS semantics to ascertain safety the performance/safety trade-off
is clearly something that's useful to have tweaks for.

And with "bulk" the concern about leaving behind stray corrupt objects
is entirely orthagonal to corcerns about losing a ref update, which is
the main thing we're worried about.

I also don't see how even if you're arguing that nobody would want one
without the other because everyone who cares about "bulk" cares about
this stray-corrupt-loose-but-no-ref-update case, how it has any business
being tied up in the "bulk" mode as far as the implementation goes.

That's because the same edge case is exposed by
core.fsyncObjectFiles=false for those who are assuming the initial
"ordered" semantics.

I.e. if we're saying that before we write the ref we'd like to not
expose the WIP objects in the primary object store because they're not
fsync'd yet, how is that mode different than "bulk" if we crash while
doing that operation (before the eventual fsync()).

So I really think it's much better to split these concerns up.

I think even if you insist on the same end-state it makes the patch
progression much *easier* to reason about. We'd then solve one problem
at a time, and start with a commit where just the semantics that are
unique to "bulk" are implemented, with nothing else conflated with
those.

> [...]
>> Ok, so it's something we could do, but passing down 2-3 functions to
>> object-file.c was a hassle.
>>
>> I tried to hack that up earlier and found that it wasn't *too
>> bad*. I.e. we'd pass some "flags" about our intent, and amend various
>> functions to take "don't close this one" and pass up the fd (or even do
>> that as a global).
>>
>> In any case, having the commit message clearly document what's needed
>> for what & what's essential & just shortcut taken for the convenience of
>> the current implementation would be really useful.
>>
>> Then we can always e.g. change this later to just do the the fsync() on
>> the last of N we write.
>>
>
> I left a comment in the (now very long) commit message that indicates the
> dummy file is there to make the API simpler.

In terms of more understandable progression I also think this series
would be much easier to understand if it converted one caller without
needing the "cookie" where doing so is easy, e.g. the unpack-objects.c
caller where we're processing nr_objects, so we can just pass down a
flag to do the fsync() for i == nr_objects.

That'll then clearly show that the whole business of having the global
state on the side is just a replacement for passing down such a flag.
Neeraj Singh March 22, 2022, 8:05 p.m. UTC | #10
On Tue, Mar 22, 2022 at 2:29 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Mon, Mar 21 2022, Neeraj Singh wrote:
>
> > On Mon, Mar 21, 2022 at 1:37 PM Ævar Arnfjörð Bjarmason
> > <avarab@gmail.com> wrote:
> >>
> >>
> >> On Mon, Mar 21 2022, Neeraj Singh wrote:
> >>
> >> [Don't have time for a full reply, sorry, just something quick]
> >>
> >> > On Mon, Mar 21, 2022 at 9:52 AM Ævar Arnfjörð Bjarmason
> >> > [...]
> >> >> So, my question is still why the temporary object dir migration part of
> >> >> this is needed.
> >> >>
> >> >> We are writing N loose object files, and we write those to temporary
> >> >> names already.
> >> >>
> >> >> AFAIKT we could do all of this by doing the same
> >> >> tmp/rename/sync_file_range dance on the main object store.
> >> >>
> >> >
> >> > Why not the main object store? We want to maintain the invariant that any
> >> > name in the main object store refers to a file that durably has the
> >> > correct contents.
> >> > If we do sync_file_range and then rename, and then crash, we now have a file
> >> > in the main object store with some SHA name, whose contents may or may not
> >> > match the SHA.  However, if we ensure an fsync happens before the rename,
> >> > a crash at any point will leave us either with no file in the main
> >> > object store or
> >> > with a file that is durable on the disk.
> >>
> >> Ah, I see.
> >>
> >> Why does that matter? If the "bulk" mode works as advertised we might
> >> have such a corrupt loose or pack file, but we won't have anything
> >> referring to it as far as reachability goes.
> >>
> >> I'm aware that the various code paths that handle OID writing don't deal
> >> too well with it in practice to say the least, which one can try with
> >> say:
> >>
> >>     $ echo foo | git hash-object -w --stdin
> >>     45b983be36b73c0788dc9cbcb76cbb80fc7bb057
> >>     $ echo | sudo tee .git/objects/45/b983be36b73c0788dc9cbcb76cbb80fc7bb057
> >>
> >> I.e. "fsck", "show" etc. will all scream bloddy murder, and re-running
> >> that hash-object again even returns successful (we see it's there
> >> already, and think it's OK).
> >>
> >
> > I was under the impression that in-practice a corrupt loose-object can create
> > persistent problems in the repo for future commands, since we might not
> > aggressively verify that an existing file with a certain OID really is
> > valid when
> > adding a new instance of the data with the same OID.
>
> Yes, it can. As the hash-object case shows we don't even check at all.
>
> For "incoming push" we *will* notice, but will just uselessly error
> out.
>
> I actually had some patches a while ago to turn off our own home-grown
> SHA-1 collision checking.
>
> It had the nice side effect of making it easier to recover from loose
> object corruption, since you could (re-)push the corrupted OID as a
> PACK, we wouldn't check (and die) on the bad loose object, and since we
> take a PACK over LOOSE we'd recover:
> https://lore.kernel.org/git/20181028225023.26427-5-avarab@gmail.com/
>
> > If you don't have an fsync barrier before producing the final
> > content-addressable
> > name, you can't reason about "this operation happened before that operation,"
> > so it wouldn't really be valid to say that "we won't have anything
> > referring to it as far
> > as reachability goes."
>
> That's correct, but we're discussing a feature that *does have* that
> fsync barrier. So if we get an error while writing the loose objects
> before the "cookie" fsync we'll presumably error out. That'll then be
> followed by an fsync() of whatever makes the objects reachable.
>

Because we have a content-addressable store which generally trusts
its contents are valid (at least when adding new instances of the same
content), the mere existence of a loose-object with a certain name is
enough to make it "reachable" to future operations, even if there are
no other immediate ways to get to that object.

> > It's entirely possible that you'd have trees pointing to other trees
> > or blobs that aren't
> > valid, since data writes can be durable in any order. At this point,
> > future attempts add
> > the same blobs or trees might silently drop the updates.  I'm betting that's why
> > core.fsyncObjectFiles was added in the first place, since someone
> > observed severe
> > persistent consequences for this form of corruption.
>
> Well, you can see Linus's original rant-as-documentation for why we
> added it :) I.e. the original git implementation made some heavy
> linux-FS assumption about the order of writes and an fsync() flushing
> any previous writes, which wasn't portable.
>
> >> But in any case, I think it would me much easier to both review and
> >> reason about this code if these concerns were split up.
> >>
> >> I.e. things that want no fsync at all (I'd think especially so) might
> >> want to have such updates serialized in this manner, and as Junio
> >> pointed out making these things inseparable as you've done creates API
> >> concerns & fallout that's got nothing to do with what we need for the
> >> performance gains of the bulk checkin fsyncing technique,
> >> e.g. concurrent "update-index" consumers not being able to assume
> >> reported objects exist as soon as they're reported.
> >>
> >
> > I want to explicitly not respond to this concern. I don't believe this
> > 100 line patch
> > can be usefully split.
>
> Leaving "usefully" aside for a second (since that's subjective), it
> clearly "can". I just tried this on top of "seen":
>
>         diff --git a/bulk-checkin.c b/bulk-checkin.c
>         index a702e0ff203..9e994c4d6ae 100644
>         --- a/bulk-checkin.c
>         +++ b/bulk-checkin.c
>         @@ -9,15 +9,12 @@
>          #include "pack.h"
>          #include "strbuf.h"
>          #include "string-list.h"
>         -#include "tmp-objdir.h"
>          #include "packfile.h"
>          #include "object-store.h"
>
>          static int bulk_checkin_plugged;
>          static int needs_batch_fsync;
>
>         -static struct tmp_objdir *bulk_fsync_objdir;
>         -
>          static struct bulk_checkin_state {
>                 char *pack_tmp_name;
>                 struct hashfile *f;
>         @@ -110,11 +107,6 @@ static void do_batch_fsync(void)
>                         strbuf_release(&temp_path);
>                         needs_batch_fsync = 0;
>                 }
>         -
>         -       if (bulk_fsync_objdir) {
>         -               tmp_objdir_migrate(bulk_fsync_objdir);
>         -               bulk_fsync_objdir = NULL;
>         -       }
>          }
>
>          static int already_written(struct bulk_checkin_state *state, struct object_id *oid)
>         @@ -321,7 +313,6 @@ void fsync_loose_object_bulk_checkin(int fd)
>                  */
>                 if (bulk_checkin_plugged &&
>                     git_fsync(fd, FSYNC_WRITEOUT_ONLY) >= 0) {
>         -               assert(bulk_fsync_objdir);
>                         if (!needs_batch_fsync)
>                                 needs_batch_fsync = 1;
>                 } else {
>         @@ -343,19 +334,6 @@ int index_bulk_checkin(struct object_id *oid,
>          void plug_bulk_checkin(void)
>          {
>                 assert(!bulk_checkin_plugged);
>         -
>         -       /*
>         -        * A temporary object directory is used to hold the files
>         -        * while they are not fsynced.
>         -        */
>         -       if (batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT)) {
>         -               bulk_fsync_objdir = tmp_objdir_create("bulk-fsync");
>         -               if (!bulk_fsync_objdir)
>         -                       die(_("Could not create temporary object directory for core.fsyncobjectfiles=batch"));
>         -
>         -               tmp_objdir_replace_primary_odb(bulk_fsync_objdir, 0);
>         -       }
>         -
>                 bulk_checkin_plugged = 1;
>          }
>
> And then tried running:
>
>     $ GIT_PERF_MAKE_OPTS='CFLAGS=-O3' ./run HEAD~ HEAD -- p3900-stash.sh
>
> And got:
>
>     Test                                              HEAD~              HEAD
>     --------------------------------------------------------------------------------------------
>     3900.2: stash 500 files (object_fsyncing=false)   0.56(0.08+0.09)    0.60(0.08+0.08) +7.1%
>     3900.4: stash 500 files (object_fsyncing=true)    14.50(0.07+0.15)   17.13(0.10+0.12) +18.1%
>     3900.6: stash 500 files (object_fsyncing=batch)   1.14(0.08+0.11)    1.03(0.08+0.10) -9.6%
>
> Now, I really don't trust that perf run to say anything except these
> being in the same ballpark, but it's clearly going to be a *bit* faster
> since we'll be doing fewer IOps.
>
> As to "usefully" I really do get what you're saying that you only find
> these useful when you combine the two because you'd like to have 100%
> safety, and that's fair enough.
>
> But since we are going to have a knob to turn off fsyncing entirely, and
> we have this "bulk" mode which requires you to carefully reason about
> your FS semantics to ascertain safety the performance/safety trade-off
> is clearly something that's useful to have tweaks for.
>
> And with "bulk" the concern about leaving behind stray corrupt objects
> is entirely orthagonal to corcerns about losing a ref update, which is
> the main thing we're worried about.
>
> I also don't see how even if you're arguing that nobody would want one
> without the other because everyone who cares about "bulk" cares about
> this stray-corrupt-loose-but-no-ref-update case, how it has any business
> being tied up in the "bulk" mode as far as the implementation goes.
>
> That's because the same edge case is exposed by
> core.fsyncObjectFiles=false for those who are assuming the initial
> "ordered" semantics.
>
> I.e. if we're saying that before we write the ref we'd like to not
> expose the WIP objects in the primary object store because they're not
> fsync'd yet, how is that mode different than "bulk" if we crash while
> doing that operation (before the eventual fsync()).
>
> So I really think it's much better to split these concerns up.
>
> I think even if you insist on the same end-state it makes the patch
> progression much *easier* to reason about. We'd then solve one problem
> at a time, and start with a commit where just the semantics that are
> unique to "bulk" are implemented, with nothing else conflated with
> those.

On Windows, where we want to have a consistent ODB by default, I'm
adding a faster way
to achieve that safety. No user is asking for a world where we are
doing half the
work to make a consistent ODB but not the other half.

This one patch works holistically to provide the full batch safety
feature, and splitting it
into two patches (which in the new version wouldn't be as clean as
you've done it above)
doesn't make the correctness of the whole thing more reviewable.  In
fact it's less reviewable
since the fsync and objdir migration are in two separate patches and a
future historian
wouldn't get as clear of a picture of the whole mechanism.

> > [...]
> >> Ok, so it's something we could do, but passing down 2-3 functions to
> >> object-file.c was a hassle.
> >>
> >> I tried to hack that up earlier and found that it wasn't *too
> >> bad*. I.e. we'd pass some "flags" about our intent, and amend various
> >> functions to take "don't close this one" and pass up the fd (or even do
> >> that as a global).
> >>
> >> In any case, having the commit message clearly document what's needed
> >> for what & what's essential & just shortcut taken for the convenience of
> >> the current implementation would be really useful.
> >>
> >> Then we can always e.g. change this later to just do the the fsync() on
> >> the last of N we write.
> >>
> >
> > I left a comment in the (now very long) commit message that indicates the
> > dummy file is there to make the API simpler.
>
> In terms of more understandable progression I also think this series
> would be much easier to understand if it converted one caller without
> needing the "cookie" where doing so is easy, e.g. the unpack-objects.c
> caller where we're processing nr_objects, so we can just pass down a
> flag to do the fsync() for i == nr_objects.
>
> That'll then clearly show that the whole business of having the global
> state on the side is just a replacement for passing down such a flag.

That seems appropriate for our mailing list discussion, but I don't see
how it helps the patch series, because we'd be doing work to fsync
the final object and then reversing that work when producing the final
end state, which uses the dummy file.

Thanks,
Neeraj
Ævar Arnfjörð Bjarmason March 23, 2022, 1:26 p.m. UTC | #11
On Sun, Mar 20 2022, Neeraj Singh via GitGitGadget wrote:

> From: Neeraj Singh <neerajsi@microsoft.com>
> [..
> diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt
> index 889522956e4..a3798dfc334 100644
> --- a/Documentation/config/core.txt
> +++ b/Documentation/config/core.txt
> @@ -628,6 +628,13 @@ core.fsyncMethod::
>  * `writeout-only` issues pagecache writeback requests, but depending on the
>    filesystem and storage hardware, data added to the repository may not be
>    durable in the event of a system crash. This is the default mode on macOS.
> +* `batch` enables a mode that uses writeout-only flushes to stage multiple
> +  updates in the disk writeback cache and then does a single full fsync of
> +  a dummy file to trigger the disk cache flush at the end of the operation.

I think adding a \n\n here would help make this more readable & break
the flow a bit. I.e. just add a "+" on its own line, followed by
"Currently...

> +  Currently `batch` mode only applies to loose-object files. Other repository
> +  data is made durable as if `fsync` was specified. This mode is expected to
> +  be as safe as `fsync` on macOS for repos stored on HFS+ or APFS filesystems
> +  and on Windows for repos stored on NTFS or ReFS filesystems.
Neeraj Singh March 24, 2022, 2:04 a.m. UTC | #12
On Wed, Mar 23, 2022 at 6:27 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Sun, Mar 20 2022, Neeraj Singh via GitGitGadget wrote:
>
> > From: Neeraj Singh <neerajsi@microsoft.com>
> > [..
> > diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt
> > index 889522956e4..a3798dfc334 100644
> > --- a/Documentation/config/core.txt
> > +++ b/Documentation/config/core.txt
> > @@ -628,6 +628,13 @@ core.fsyncMethod::
> >  * `writeout-only` issues pagecache writeback requests, but depending on the
> >    filesystem and storage hardware, data added to the repository may not be
> >    durable in the event of a system crash. This is the default mode on macOS.
> > +* `batch` enables a mode that uses writeout-only flushes to stage multiple
> > +  updates in the disk writeback cache and then does a single full fsync of
> > +  a dummy file to trigger the disk cache flush at the end of the operation.
>
> I think adding a \n\n here would help make this more readable & break
> the flow a bit. I.e. just add a "+" on its own line, followed by
> "Currently...
>
> > +  Currently `batch` mode only applies to loose-object files. Other repository
> > +  data is made durable as if `fsync` was specified. This mode is expected to
> > +  be as safe as `fsync` on macOS for repos stored on HFS+ or APFS filesystems
> > +  and on Windows for repos stored on NTFS or ReFS filesystems.

Thanks, will fix.
diff mbox series

Patch

diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt
index 889522956e4..a3798dfc334 100644
--- a/Documentation/config/core.txt
+++ b/Documentation/config/core.txt
@@ -628,6 +628,13 @@  core.fsyncMethod::
 * `writeout-only` issues pagecache writeback requests, but depending on the
   filesystem and storage hardware, data added to the repository may not be
   durable in the event of a system crash. This is the default mode on macOS.
+* `batch` enables a mode that uses writeout-only flushes to stage multiple
+  updates in the disk writeback cache and then does a single full fsync of
+  a dummy file to trigger the disk cache flush at the end of the operation.
+  Currently `batch` mode only applies to loose-object files. Other repository
+  data is made durable as if `fsync` was specified. This mode is expected to
+  be as safe as `fsync` on macOS for repos stored on HFS+ or APFS filesystems
+  and on Windows for repos stored on NTFS or ReFS filesystems.
 
 core.fsyncObjectFiles::
 	This boolean will enable 'fsync()' when writing object files.
diff --git a/bulk-checkin.c b/bulk-checkin.c
index 93b1dc5138a..a702e0ff203 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -3,14 +3,20 @@ 
  */
 #include "cache.h"
 #include "bulk-checkin.h"
+#include "lockfile.h"
 #include "repository.h"
 #include "csum-file.h"
 #include "pack.h"
 #include "strbuf.h"
+#include "string-list.h"
+#include "tmp-objdir.h"
 #include "packfile.h"
 #include "object-store.h"
 
 static int bulk_checkin_plugged;
+static int needs_batch_fsync;
+
+static struct tmp_objdir *bulk_fsync_objdir;
 
 static struct bulk_checkin_state {
 	char *pack_tmp_name;
@@ -80,6 +86,37 @@  clear_exit:
 	reprepare_packed_git(the_repository);
 }
 
+/*
+ * Cleanup after batch-mode fsync_object_files.
+ */
+static void do_batch_fsync(void)
+{
+	/*
+	 * Issue a full hardware flush against a temporary file to ensure
+	 * that all objects are durable before any renames occur.  The code in
+	 * fsync_loose_object_bulk_checkin has already issued a writeout
+	 * request, but it has not flushed any writeback cache in the storage
+	 * hardware.
+	 */
+
+	if (needs_batch_fsync) {
+		struct strbuf temp_path = STRBUF_INIT;
+		struct tempfile *temp;
+
+		strbuf_addf(&temp_path, "%s/bulk_fsync_XXXXXX", get_object_directory());
+		temp = xmks_tempfile(temp_path.buf);
+		fsync_or_die(get_tempfile_fd(temp), get_tempfile_path(temp));
+		delete_tempfile(&temp);
+		strbuf_release(&temp_path);
+		needs_batch_fsync = 0;
+	}
+
+	if (bulk_fsync_objdir) {
+		tmp_objdir_migrate(bulk_fsync_objdir);
+		bulk_fsync_objdir = NULL;
+	}
+}
+
 static int already_written(struct bulk_checkin_state *state, struct object_id *oid)
 {
 	int i;
@@ -274,6 +311,24 @@  static int deflate_to_pack(struct bulk_checkin_state *state,
 	return 0;
 }
 
+void fsync_loose_object_bulk_checkin(int fd)
+{
+	/*
+	 * If we have a plugged bulk checkin, we issue a call that
+	 * cleans the filesystem page cache but avoids a hardware flush
+	 * command. Later on we will issue a single hardware flush
+	 * before as part of do_batch_fsync.
+	 */
+	if (bulk_checkin_plugged &&
+	    git_fsync(fd, FSYNC_WRITEOUT_ONLY) >= 0) {
+		assert(bulk_fsync_objdir);
+		if (!needs_batch_fsync)
+			needs_batch_fsync = 1;
+	} else {
+		fsync_or_die(fd, "loose object file");
+	}
+}
+
 int index_bulk_checkin(struct object_id *oid,
 		       int fd, size_t size, enum object_type type,
 		       const char *path, unsigned flags)
@@ -288,6 +343,19 @@  int index_bulk_checkin(struct object_id *oid,
 void plug_bulk_checkin(void)
 {
 	assert(!bulk_checkin_plugged);
+
+	/*
+	 * A temporary object directory is used to hold the files
+	 * while they are not fsynced.
+	 */
+	if (batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT)) {
+		bulk_fsync_objdir = tmp_objdir_create("bulk-fsync");
+		if (!bulk_fsync_objdir)
+			die(_("Could not create temporary object directory for core.fsyncobjectfiles=batch"));
+
+		tmp_objdir_replace_primary_odb(bulk_fsync_objdir, 0);
+	}
+
 	bulk_checkin_plugged = 1;
 }
 
@@ -297,4 +365,6 @@  void unplug_bulk_checkin(void)
 	bulk_checkin_plugged = 0;
 	if (bulk_checkin_state.f)
 		finish_bulk_checkin(&bulk_checkin_state);
+
+	do_batch_fsync();
 }
diff --git a/bulk-checkin.h b/bulk-checkin.h
index b26f3dc3b74..08f292379b6 100644
--- a/bulk-checkin.h
+++ b/bulk-checkin.h
@@ -6,6 +6,8 @@ 
 
 #include "cache.h"
 
+void fsync_loose_object_bulk_checkin(int fd);
+
 int index_bulk_checkin(struct object_id *oid,
 		       int fd, size_t size, enum object_type type,
 		       const char *path, unsigned flags);
diff --git a/cache.h b/cache.h
index 3160bc1e489..d1ae51388c9 100644
--- a/cache.h
+++ b/cache.h
@@ -1040,7 +1040,8 @@  extern int use_fsync;
 
 enum fsync_method {
 	FSYNC_METHOD_FSYNC,
-	FSYNC_METHOD_WRITEOUT_ONLY
+	FSYNC_METHOD_WRITEOUT_ONLY,
+	FSYNC_METHOD_BATCH
 };
 
 extern enum fsync_method fsync_method;
@@ -1767,6 +1768,11 @@  void fsync_or_die(int fd, const char *);
 int fsync_component(enum fsync_component component, int fd);
 void fsync_component_or_die(enum fsync_component component, int fd, const char *msg);
 
+static inline int batch_fsync_enabled(enum fsync_component component)
+{
+	return (fsync_components & component) && (fsync_method == FSYNC_METHOD_BATCH);
+}
+
 ssize_t read_in_full(int fd, void *buf, size_t count);
 ssize_t write_in_full(int fd, const void *buf, size_t count);
 ssize_t pread_in_full(int fd, void *buf, size_t count, off_t offset);
diff --git a/config.c b/config.c
index 261ee7436e0..0b28f90de8b 100644
--- a/config.c
+++ b/config.c
@@ -1688,6 +1688,8 @@  static int git_default_core_config(const char *var, const char *value, void *cb)
 			fsync_method = FSYNC_METHOD_FSYNC;
 		else if (!strcmp(value, "writeout-only"))
 			fsync_method = FSYNC_METHOD_WRITEOUT_ONLY;
+		else if (!strcmp(value, "batch"))
+			fsync_method = FSYNC_METHOD_BATCH;
 		else
 			warning(_("ignoring unknown core.fsyncMethod value '%s'"), value);
 
diff --git a/object-file.c b/object-file.c
index 5258d9ed827..bdb0a38328f 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1895,6 +1895,8 @@  static void close_loose_object(int fd)
 
 	if (fsync_object_files > 0)
 		fsync_or_die(fd, "loose object file");
+	else if (batch_fsync_enabled(FSYNC_COMPONENT_LOOSE_OBJECT))
+		fsync_loose_object_bulk_checkin(fd);
 	else
 		fsync_component_or_die(FSYNC_COMPONENT_LOOSE_OBJECT, fd,
 				       "loose object file");