mbox series

[v9,0/9] Implement a batched fsync option for core.fsyncObjectFiles

Message ID pull.1076.v9.git.git.1637020263.gitgitgadget@gmail.com (mailing list archive)
Headers show
Series Implement a batched fsync option for core.fsyncObjectFiles | expand

Message

Philippe Blain via GitGitGadget Nov. 15, 2021, 11:50 p.m. UTC
Thanks to everyone for review so far!

This series shares the base tmp-objdir patches with my merged version of
Elijah Newren's remerge-diff series at:
https://github.com/neerajsi-msft/git/tree/neerajsi/remerge-diff.

Changes between v8 and v9 [1]:

 * Rebased onto master at tag v2.34.0
 * Fixed git-prune bug when trying to clean up multiple cruft directories.
 * Preserve the tmp-objdir around update_relative_gitdir, which is called by
   setup_work_tree through the chdir_notify mechanism.
 * Per [2], I'm leaving the fsyncObjectFiles configuration as is with
   'true', 'false', and 'batch'. This makes using old and new versions of
   git with 'batch' mode a little trickier, but hopefully people will
   generally be moving forward in versions.

[1] See
https://lore.kernel.org/git/pull.1067.git.1635287730.gitgitgadget@gmail.com/
[2] https://lore.kernel.org/git/xmqqh7cimuxt.fsf@gitster.g/

Changes between v7 and v8:

 * Dropped the tmp-objdir patch to avoid renaming in a quarantine/temporary
   objdir, as suggested by Jeff King. This wasn't a good idea because we
   don't really know that there's only a single reader/writer. Avoiding the
   rename was a relatively minor perf optimization so it's okay to drop.

 * Added disable_ref_updates logic (as a flag on the odb) which is set when
   we're in a quarantine or when a tmp objdir is active. I believe this
   roughly follows the strategy suggested by Jeff King.

Neeraj Singh (9):
  tmp-objdir: new API for creating temporary writable databases
  tmp-objdir: disable ref updates when replacing the primary odb
  bulk-checkin: rename 'state' variable and separate 'plugged' boolean
  core.fsyncobjectfiles: batched disk flushes
  core.fsyncobjectfiles: add windows support for batch mode
  update-index: use the bulk-checkin infrastructure
  unpack-objects: use the bulk-checkin infrastructure
  core.fsyncobjectfiles: tests for batch mode
  core.fsyncobjectfiles: performance tests for add and stash

 Documentation/config/core.txt       | 29 ++++++++--
 Makefile                            |  6 ++
 builtin/prune.c                     | 23 ++++++--
 builtin/receive-pack.c              |  2 +-
 builtin/unpack-objects.c            |  3 +
 builtin/update-index.c              |  6 ++
 bulk-checkin.c                      | 90 +++++++++++++++++++++++++----
 bulk-checkin.h                      |  2 +
 cache.h                             |  8 ++-
 compat/mingw.h                      |  3 +
 compat/win32/flush.c                | 28 +++++++++
 config.c                            |  7 ++-
 config.mak.uname                    |  3 +
 configure.ac                        |  8 +++
 contrib/buildsystems/CMakeLists.txt |  3 +-
 environment.c                       | 11 +++-
 git-compat-util.h                   |  7 +++
 object-file.c                       | 60 ++++++++++++++++++-
 object-store.h                      | 26 +++++++++
 object.c                            |  2 +-
 refs.c                              |  2 +-
 repository.c                        |  2 +
 repository.h                        |  1 +
 t/lib-unique-files.sh               | 36 ++++++++++++
 t/perf/p3700-add.sh                 | 43 ++++++++++++++
 t/perf/p3900-stash.sh               | 46 +++++++++++++++
 t/t3700-add.sh                      | 20 +++++++
 t/t3903-stash.sh                    | 14 +++++
 t/t5300-pack-object.sh              | 30 ++++++----
 tmp-objdir.c                        | 55 +++++++++++++++++-
 tmp-objdir.h                        | 29 +++++++++-
 wrapper.c                           | 48 +++++++++++++++
 write-or-die.c                      |  2 +-
 33 files changed, 608 insertions(+), 47 deletions(-)
 create mode 100644 compat/win32/flush.c
 create mode 100644 t/lib-unique-files.sh
 create mode 100755 t/perf/p3700-add.sh
 create mode 100755 t/perf/p3900-stash.sh


base-commit: cd3e606211bb1cf8bc57f7d76bab98cc17a150bc
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1076%2Fneerajsi-msft%2Fneerajsi%2Fbulk-fsync-object-files-v9
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1076/neerajsi-msft/neerajsi/bulk-fsync-object-files-v9
Pull-Request: https://github.com/git/git/pull/1076

Range-diff vs v8:

  1:  f03797fd80d !  1:  6b27afa60e0 tmp-objdir: new API for creating temporary writable databases
     @@ Commit message
          Based-on-patch-by: Elijah Newren <newren@gmail.com>
      
          Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
     +    Signed-off-by: Junio C Hamano <gitster@pobox.com>
      
       ## builtin/prune.c ##
      @@ builtin/prune.c: static int show_only;
     @@ builtin/prune.c: static int prune_tmp_file(const char *fullpath)
      +		if (show_only || verbose)
      +			printf("Removing stale temporary directory %s\n", fullpath);
      +		if (!show_only) {
     ++			strbuf_reset(&remove_dir_buf);
      +			strbuf_addstr(&remove_dir_buf, fullpath);
      +			remove_dir_recursively(&remove_dir_buf, 0);
      +		}
     @@ builtin/receive-pack.c: static const char *unpack(int err_fd, struct shallow_inf
       		if (err_fd > 0)
       			close(err_fd);
      
     + ## environment.c ##
     +@@
     + #include "commit.h"
     + #include "strvec.h"
     + #include "object-store.h"
     ++#include "tmp-objdir.h"
     + #include "chdir-notify.h"
     + #include "shallow.h"
     + 
     +@@ environment.c: static void update_relative_gitdir(const char *name,
     + 				   void *data)
     + {
     + 	char *path = reparent_relative_path(old_cwd, new_cwd, get_git_dir());
     ++	struct tmp_objdir *tmp_objdir = tmp_objdir_unapply_primary_odb();
     + 	trace_printf_key(&trace_setup_key,
     + 			 "setup: move $GIT_DIR to '%s'",
     + 			 path);
     ++
     + 	set_git_dir_1(path);
     ++	if (tmp_objdir)
     ++		tmp_objdir_reapply_primary_odb(tmp_objdir, old_cwd, new_cwd);
     + 	free(path);
     + }
     + 
     +
       ## object-file.c ##
      @@ object-file.c: void add_to_alternates_memory(const char *reference)
       			     '\n', NULL, 0);
     @@ object.c: struct raw_object_store *raw_object_store_new(void)
       	odb_clear_loose_cache(odb);
      
       ## tmp-objdir.c ##
     +@@
     + #include "cache.h"
     + #include "tmp-objdir.h"
     ++#include "chdir-notify.h"
     + #include "dir.h"
     + #include "sigchain.h"
     + #include "string-list.h"
      @@
       struct tmp_objdir {
       	struct strbuf path;
       	struct strvec env;
      +	struct object_directory *prev_odb;
     ++	int will_destroy;
       };
       
       /*
     @@ tmp-objdir.c: void tmp_objdir_add_as_alternate(const struct tmp_objdir *t)
      +	if (t->prev_odb)
      +		BUG("the primary object database is already replaced");
      +	t->prev_odb = set_temporary_primary_odb(t->path.buf, will_destroy);
     ++	t->will_destroy = will_destroy;
     ++}
     ++
     ++struct tmp_objdir *tmp_objdir_unapply_primary_odb(void)
     ++{
     ++	if (!the_tmp_objdir || !the_tmp_objdir->prev_odb)
     ++		return NULL;
     ++
     ++	restore_primary_odb(the_tmp_objdir->prev_odb, the_tmp_objdir->path.buf);
     ++	the_tmp_objdir->prev_odb = NULL;
     ++	return the_tmp_objdir;
     ++}
     ++
     ++void tmp_objdir_reapply_primary_odb(struct tmp_objdir *t, const char *old_cwd,
     ++		const char *new_cwd)
     ++{
     ++	char *path;
     ++
     ++	path = reparent_relative_path(old_cwd, new_cwd, t->path.buf);
     ++	strbuf_reset(&t->path);
     ++	strbuf_addstr(&t->path, path);
     ++	free(path);
     ++	tmp_objdir_replace_primary_odb(t, t->will_destroy);
      +}
      
       ## tmp-objdir.h ##
     @@ tmp-objdir.h: int tmp_objdir_destroy(struct tmp_objdir *);
      + * If will_destroy is nonzero, the object directory may not be migrated.
      + */
      +void tmp_objdir_replace_primary_odb(struct tmp_objdir *, int will_destroy);
     ++
     ++/*
     ++ * If the primary object database was replaced by a temporary object directory,
     ++ * restore it to its original value while keeping the directory contents around.
     ++ * Returns NULL if the primary object database was not replaced.
     ++ */
     ++struct tmp_objdir *tmp_objdir_unapply_primary_odb(void);
     ++
     ++/*
     ++ * Reapplies the former primary temporary object database, after protentially
     ++ * changing its relative path.
     ++ */
     ++void tmp_objdir_reapply_primary_odb(struct tmp_objdir *, const char *old_cwd,
     ++		const char *new_cwd);
     ++
      +
       #endif /* TMP_OBJDIR_H */
  2:  bc085137340 !  2:  71817cccfb9 tmp-objdir: disable ref updates when replacing the primary odb
     @@ Commit message
          Reported-by: Jeff King <peff@peff.net>
      
          Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
     +    Signed-off-by: Junio C Hamano <gitster@pobox.com>
      
       ## environment.c ##
      @@ environment.c: void setup_git_env(const char *git_dir)
  3:  9335646ed91 =  3:  8fd1ca4c00a bulk-checkin: rename 'state' variable and separate 'plugged' boolean
  4:  b9d3d874432 =  4:  e1747ce00af core.fsyncobjectfiles: batched disk flushes
  5:  8df32eaaa9a !  5:  951a559874e core.fsyncobjectfiles: add windows support for batch mode
     @@ config.mak.uname: endif
       		compat/win32/path-utils.o \
       		compat/win32/pthread.o compat/win32/syslog.o \
       		compat/win32/trace2_win32_process_info.o \
     -@@ config.mak.uname: ifneq (,$(findstring MINGW,$(uname_S)))
     +@@ config.mak.uname: ifeq ($(uname_S),MINGW)
       	COMPAT_CFLAGS += -DSTRIP_EXTENSION=\".exe\"
       	COMPAT_OBJS += compat/mingw.o compat/winansi.o \
       		compat/win32/trace2_win32_process_info.o \
  6:  15767270984 =  6:  4a40fd4a29a update-index: use the bulk-checkin infrastructure
  7:  e88bab809a2 =  7:  cfc6a347d08 unpack-objects: use the bulk-checkin infrastructure
  8:  811d6d31509 !  8:  270c24827d0 core.fsyncobjectfiles: tests for batch mode
     @@ t/lib-unique-files.sh (new)
      
       ## t/t3700-add.sh ##
      @@ t/t3700-add.sh: test_description='Test of git add, including the -- option.'
     - 
     + TEST_PASSES_SANITIZE_LEAK=true
       . ./test-lib.sh
       
      +. $TEST_DIRECTORY/lib-unique-files.sh
  9:  f4fa20f591e =  9:  12d99641f4c core.fsyncobjectfiles: performance tests for add and stash

Comments

Ævar Arnfjörð Bjarmason Nov. 16, 2021, 8:02 a.m. UTC | #1
On Mon, Nov 15 2021, Neeraj K. Singh via GitGitGadget wrote:

>  * Per [2], I'm leaving the fsyncObjectFiles configuration as is with
>    'true', 'false', and 'batch'. This makes using old and new versions of
>    git with 'batch' mode a little trickier, but hopefully people will
>    generally be moving forward in versions.
>
> [1] See
> https://lore.kernel.org/git/pull.1067.git.1635287730.gitgitgadget@gmail.com/
> [2] https://lore.kernel.org/git/xmqqh7cimuxt.fsf@gitster.g/

I really think leaving that in-place is just being unnecessarily
cavalier. There's a lot of mixed-version environments where git is
deployed in, and we almost never break the configuration in this way (I
think in the past always by mistake).

In this case it's easy to avoid it, and coming up with a less narrow
config model[1] seems like a good idea in any case to unify the various
outstanding work in this area.

More generally on this series, per the thread ending in [2] I really
don't get why we have code like this:
	
	@@ -503,10 +504,12 @@ static void unpack_all(void)
	 	if (!quiet)
	 		progress = start_progress(_("Unpacking objects"), nr_objects);
	 	CALLOC_ARRAY(obj_list, nr_objects);
	+	plug_bulk_checkin();
	 	for (i = 0; i < nr_objects; i++) {
	 		unpack_one(i);
	 		display_progress(progress, i + 1);
	 	}
	+	unplug_bulk_checkin();
	 	stop_progress(&progress);
	 
	 	if (delta_list)

As opposed to doing an fsync on the last object we're
processing. I.e. why do we need the step of intentionally making the
objects unavailable in the tmp-objdir, and creating a "cookie" file to
sync at the start/end, as opposed to fsyncing on the last file (which
we're writing out anyway).

1. https://lore.kernel.org/git/211110.86r1bogg27.gmgdl@evledraar.gmail.com/
2. https://lore.kernel.org/git/20211111000349.GA703@neerajsi-x1.localdomain/
Neeraj Singh Nov. 17, 2021, 7:06 a.m. UTC | #2
On Tue, Nov 16, 2021 at 12:10 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Mon, Nov 15 2021, Neeraj K. Singh via GitGitGadget wrote:
>
> >  * Per [2], I'm leaving the fsyncObjectFiles configuration as is with
> >    'true', 'false', and 'batch'. This makes using old and new versions of
> >    git with 'batch' mode a little trickier, but hopefully people will
> >    generally be moving forward in versions.
> >
> > [1] See
> > https://lore.kernel.org/git/pull.1067.git.1635287730.gitgitgadget@gmail.com/
> > [2] https://lore.kernel.org/git/xmqqh7cimuxt.fsf@gitster.g/
>
> I really think leaving that in-place is just being unnecessarily
> cavalier. There's a lot of mixed-version environments where git is
> deployed in, and we almost never break the configuration in this way (I
> think in the past always by mistake).

> In this case it's easy to avoid it, and coming up with a less narrow
> config model[1] seems like a good idea in any case to unify the various
> outstanding work in this area.
>
> More generally on this series, per the thread ending in [2] I really

My primary goal in all of these changes is to move git-for-windows over to
a default of batch fsync so that it can get closer to other platforms
in performance
of 'git add' while still retaining the same level of data integrity.
I'm hoping that
most end-users are just sticking to defaults here.

I'm happy to change the configuration schema again if there's a
consensus from the Git
community that backwards-compatibility of the configuration is
actually important to someone.

Also, if we're doing a deeper rethink of the fsync configuration (as
prompted by this work and
Eric Wong's and Patrick Steinhardts work), do we want to retain a mode
where we fsync some
parts of the persistent repo data but not others?  If we add fsyncing
of the index in addition to the refs,
I believe we would have covered all of the critical data structures
that would be needed to find the
data that a user has added to the repo if they complete a series of
git commands and then experience
a system crash.

> don't get why we have code like this:
>
>         @@ -503,10 +504,12 @@ static void unpack_all(void)
>                 if (!quiet)
>                         progress = start_progress(_("Unpacking objects"), nr_objects);
>                 CALLOC_ARRAY(obj_list, nr_objects);
>         +       plug_bulk_checkin();
>                 for (i = 0; i < nr_objects; i++) {
>                         unpack_one(i);
>                         display_progress(progress, i + 1);
>                 }
>         +       unplug_bulk_checkin();
>                 stop_progress(&progress);
>
>                 if (delta_list)
>
> As opposed to doing an fsync on the last object we're
> processing. I.e. why do we need the step of intentionally making the
> objects unavailable in the tmp-objdir, and creating a "cookie" file to
> sync at the start/end, as opposed to fsyncing on the last file (which
> we're writing out anyway).
>
> 1. https://lore.kernel.org/git/211110.86r1bogg27.gmgdl@evledraar.gmail.com/
> 2. https://lore.kernel.org/git/20211111000349.GA703@neerajsi-x1.localdomain/

It's important to not expose an object's final name until its contents
have been fsynced
to disk. We want to ensure that wherever we crash, we won't have a
loose object that
Git may later try to open where the filename doesn't match the content
hash. I believe it's
okay for a given OID to be missing, since a later command could
recreate it, but an object
with a wrong hash looks like it would persist until we do a git-fsck.

I thought about figuring out how to sync the last object rather than some random
"cookie" file, but it wasn't clear to me how I'd figure out which
object is actually last
from library code in a way that doesn't burden each command with
somehow figuring
out its last object and communicating that. The 'cookie' approach
seems to lead to a cleaner
interface for callers.

Thanks,
Neeraj
Ævar Arnfjörð Bjarmason Nov. 17, 2021, 7:24 a.m. UTC | #3
On Tue, Nov 16 2021, Neeraj Singh wrote:

> On Tue, Nov 16, 2021 at 12:10 AM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>
>>
>> On Mon, Nov 15 2021, Neeraj K. Singh via GitGitGadget wrote:
>>
>> >  * Per [2], I'm leaving the fsyncObjectFiles configuration as is with
>> >    'true', 'false', and 'batch'. This makes using old and new versions of
>> >    git with 'batch' mode a little trickier, but hopefully people will
>> >    generally be moving forward in versions.
>> >
>> > [1] See
>> > https://lore.kernel.org/git/pull.1067.git.1635287730.gitgitgadget@gmail.com/
>> > [2] https://lore.kernel.org/git/xmqqh7cimuxt.fsf@gitster.g/
>>
>> I really think leaving that in-place is just being unnecessarily
>> cavalier. There's a lot of mixed-version environments where git is
>> deployed in, and we almost never break the configuration in this way (I
>> think in the past always by mistake).
>
>> In this case it's easy to avoid it, and coming up with a less narrow
>> config model[1] seems like a good idea in any case to unify the various
>> outstanding work in this area.
>>
>> More generally on this series, per the thread ending in [2] I really
>
> My primary goal in all of these changes is to move git-for-windows over to
> a default of batch fsync so that it can get closer to other platforms
> in performance
> of 'git add' while still retaining the same level of data integrity.
> I'm hoping that
> most end-users are just sticking to defaults here.
>
> I'm happy to change the configuration schema again if there's a
> consensus from the Git
> community that backwards-compatibility of the configuration is
> actually important to someone.
>
> Also, if we're doing a deeper rethink of the fsync configuration (as
> prompted by this work and
> Eric Wong's and Patrick Steinhardts work), do we want to retain a mode
> where we fsync some
> parts of the persistent repo data but not others?  If we add fsyncing
> of the index in addition to the refs,
> I believe we would have covered all of the critical data structures
> that would be needed to find the
> data that a user has added to the repo if they complete a series of
> git commands and then experience
> a system crash.

Just talking about it is how we'll find consensus, maybe you & Junio
would like to keep it as-is. I don't see why we'd expose this bad edge
case in configuration handling to users when it's entirely avoidable,
and we're still in the design phase.

>> don't get why we have code like this:
>>
>>         @@ -503,10 +504,12 @@ static void unpack_all(void)
>>                 if (!quiet)
>>                         progress = start_progress(_("Unpacking objects"), nr_objects);
>>                 CALLOC_ARRAY(obj_list, nr_objects);
>>         +       plug_bulk_checkin();
>>                 for (i = 0; i < nr_objects; i++) {
>>                         unpack_one(i);
>>                         display_progress(progress, i + 1);
>>                 }
>>         +       unplug_bulk_checkin();
>>                 stop_progress(&progress);
>>
>>                 if (delta_list)
>>
>> As opposed to doing an fsync on the last object we're
>> processing. I.e. why do we need the step of intentionally making the
>> objects unavailable in the tmp-objdir, and creating a "cookie" file to
>> sync at the start/end, as opposed to fsyncing on the last file (which
>> we're writing out anyway).
>>
>> 1. https://lore.kernel.org/git/211110.86r1bogg27.gmgdl@evledraar.gmail.com/
>> 2. https://lore.kernel.org/git/20211111000349.GA703@neerajsi-x1.localdomain/
>
> It's important to not expose an object's final name until its contents
> have been fsynced
> to disk. We want to ensure that wherever we crash, we won't have a
> loose object that
> Git may later try to open where the filename doesn't match the content
> hash. I believe it's
> okay for a given OID to be missing, since a later command could
> recreate it, but an object
> with a wrong hash looks like it would persist until we do a git-fsck.

Yes, we handle that rather badly, as I mentioned in some other threads,
but not doing the fsync on the last object v.s. a "cookie" file right
afterwards seems like a hail-mary at best, no?

> I thought about figuring out how to sync the last object rather than some random
> "cookie" file, but it wasn't clear to me how I'd figure out which
> object is actually last
> from library code in a way that doesn't burden each command with
> somehow figuring
> out its last object and communicating that. The 'cookie' approach
> seems to lead to a cleaner
> interface for callers.

The above quoted code is looping through nr_objects isn't it? Can't a
"do fsync" be passed down to unpack_one() when we process the last loose
object?
Neeraj Singh Nov. 18, 2021, 5:03 a.m. UTC | #4
On Tue, Nov 16, 2021 at 11:28 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Tue, Nov 16 2021, Neeraj Singh wrote:
>
> > On Tue, Nov 16, 2021 at 12:10 AM Ævar Arnfjörð Bjarmason
> > <avarab@gmail.com> wrote:
> >>
> >>
> >> On Mon, Nov 15 2021, Neeraj K. Singh via GitGitGadget wrote:
> >>
> >> >  * Per [2], I'm leaving the fsyncObjectFiles configuration as is with
> >> >    'true', 'false', and 'batch'. This makes using old and new versions of
> >> >    git with 'batch' mode a little trickier, but hopefully people will
> >> >    generally be moving forward in versions.
> >> >
> >> > [1] See
> >> > https://lore.kernel.org/git/pull.1067.git.1635287730.gitgitgadget@gmail.com/
> >> > [2] https://lore.kernel.org/git/xmqqh7cimuxt.fsf@gitster.g/
> >>
> >> I really think leaving that in-place is just being unnecessarily
> >> cavalier. There's a lot of mixed-version environments where git is
> >> deployed in, and we almost never break the configuration in this way (I
> >> think in the past always by mistake).
> >
> >> In this case it's easy to avoid it, and coming up with a less narrow
> >> config model[1] seems like a good idea in any case to unify the various
> >> outstanding work in this area.
> >>
> >> More generally on this series, per the thread ending in [2] I really
> >
> > My primary goal in all of these changes is to move git-for-windows over to
> > a default of batch fsync so that it can get closer to other platforms
> > in performance
> > of 'git add' while still retaining the same level of data integrity.
> > I'm hoping that
> > most end-users are just sticking to defaults here.
> >
> > I'm happy to change the configuration schema again if there's a
> > consensus from the Git
> > community that backwards-compatibility of the configuration is
> > actually important to someone.
> >
> > Also, if we're doing a deeper rethink of the fsync configuration (as
> > prompted by this work and
> > Eric Wong's and Patrick Steinhardts work), do we want to retain a mode
> > where we fsync some
> > parts of the persistent repo data but not others?  If we add fsyncing
> > of the index in addition to the refs,
> > I believe we would have covered all of the critical data structures
> > that would be needed to find the
> > data that a user has added to the repo if they complete a series of
> > git commands and then experience
> > a system crash.
>
> Just talking about it is how we'll find consensus, maybe you & Junio
> would like to keep it as-is. I don't see why we'd expose this bad edge
> case in configuration handling to users when it's entirely avoidable,
> and we're still in the design phase.

After trying to figure out an implementation, I have a new proposal,
which I've shared on the other thread [1].

[1] https://lore.kernel.org/git/CANQDOdcdhfGtPg0PxpXQA5gQ4x9VknKDKCCi4HEB0Z1xgnjKzg@mail.gmail.com/

>
> >> don't get why we have code like this:
> >>
> >>         @@ -503,10 +504,12 @@ static void unpack_all(void)
> >>                 if (!quiet)
> >>                         progress = start_progress(_("Unpacking objects"), nr_objects);
> >>                 CALLOC_ARRAY(obj_list, nr_objects);
> >>         +       plug_bulk_checkin();
> >>                 for (i = 0; i < nr_objects; i++) {
> >>                         unpack_one(i);
> >>                         display_progress(progress, i + 1);
> >>                 }
> >>         +       unplug_bulk_checkin();
> >>                 stop_progress(&progress);
> >>
> >>                 if (delta_list)
> >>
> >> As opposed to doing an fsync on the last object we're
> >> processing. I.e. why do we need the step of intentionally making the
> >> objects unavailable in the tmp-objdir, and creating a "cookie" file to
> >> sync at the start/end, as opposed to fsyncing on the last file (which
> >> we're writing out anyway).
> >>
> >> 1. https://lore.kernel.org/git/211110.86r1bogg27.gmgdl@evledraar.gmail.com/
> >> 2. https://lore.kernel.org/git/20211111000349.GA703@neerajsi-x1.localdomain/
> >
> > It's important to not expose an object's final name until its contents
> > have been fsynced
> > to disk. We want to ensure that wherever we crash, we won't have a
> > loose object that
> > Git may later try to open where the filename doesn't match the content
> > hash. I believe it's
> > okay for a given OID to be missing, since a later command could
> > recreate it, but an object
> > with a wrong hash looks like it would persist until we do a git-fsck.
>
> Yes, we handle that rather badly, as I mentioned in some other threads,
> but not doing the fsync on the last object v.s. a "cookie" file right
> afterwards seems like a hail-mary at best, no?
>

I'm not quite grasping what you're saying here. Are you saying that
using a dummy
file instead of one of the actual objects is less likely to produce
the desired outcome
on actual filesystem implementations?

> > I thought about figuring out how to sync the last object rather than some random
> > "cookie" file, but it wasn't clear to me how I'd figure out which
> > object is actually last
> > from library code in a way that doesn't burden each command with
> > somehow figuring
> > out its last object and communicating that. The 'cookie' approach
> > seems to lead to a cleaner
> > interface for callers.
>
> The above quoted code is looping through nr_objects isn't it? Can't a
> "do fsync" be passed down to unpack_one() when we process the last loose
> object?

Are you proposing that we do something different for unpack_objects
versus update_index
and git-add?  I was hoping to keep all of the users of the batch fsync
functionality equivalent.
For the git-add workflow and update-index, we'd need to track the most
recent file so that we
can go back and fsync it.  I don't believe that syncing the last
object composes well with the existing
implementation of those commands.

Thanks,
Neeraj
Ævar Arnfjörð Bjarmason Dec. 1, 2021, 2:15 p.m. UTC | #5
On Wed, Nov 17 2021, Neeraj Singh wrote:

[Very late reply, sorry]

> On Tue, Nov 16, 2021 at 11:28 PM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>
>>
>> On Tue, Nov 16 2021, Neeraj Singh wrote:
>>
>> > On Tue, Nov 16, 2021 at 12:10 AM Ævar Arnfjörð Bjarmason
>> > <avarab@gmail.com> wrote:
>> >>
>> >>
>> >> On Mon, Nov 15 2021, Neeraj K. Singh via GitGitGadget wrote:
>> >>
>> >> >  * Per [2], I'm leaving the fsyncObjectFiles configuration as is with
>> >> >    'true', 'false', and 'batch'. This makes using old and new versions of
>> >> >    git with 'batch' mode a little trickier, but hopefully people will
>> >> >    generally be moving forward in versions.
>> >> >
>> >> > [1] See
>> >> > https://lore.kernel.org/git/pull.1067.git.1635287730.gitgitgadget@gmail.com/
>> >> > [2] https://lore.kernel.org/git/xmqqh7cimuxt.fsf@gitster.g/
>> >>
>> >> I really think leaving that in-place is just being unnecessarily
>> >> cavalier. There's a lot of mixed-version environments where git is
>> >> deployed in, and we almost never break the configuration in this way (I
>> >> think in the past always by mistake).
>> >
>> >> In this case it's easy to avoid it, and coming up with a less narrow
>> >> config model[1] seems like a good idea in any case to unify the various
>> >> outstanding work in this area.
>> >>
>> >> More generally on this series, per the thread ending in [2] I really
>> >
>> > My primary goal in all of these changes is to move git-for-windows over to
>> > a default of batch fsync so that it can get closer to other platforms
>> > in performance
>> > of 'git add' while still retaining the same level of data integrity.
>> > I'm hoping that
>> > most end-users are just sticking to defaults here.
>> >
>> > I'm happy to change the configuration schema again if there's a
>> > consensus from the Git
>> > community that backwards-compatibility of the configuration is
>> > actually important to someone.
>> >
>> > Also, if we're doing a deeper rethink of the fsync configuration (as
>> > prompted by this work and
>> > Eric Wong's and Patrick Steinhardts work), do we want to retain a mode
>> > where we fsync some
>> > parts of the persistent repo data but not others?  If we add fsyncing
>> > of the index in addition to the refs,
>> > I believe we would have covered all of the critical data structures
>> > that would be needed to find the
>> > data that a user has added to the repo if they complete a series of
>> > git commands and then experience
>> > a system crash.
>>
>> Just talking about it is how we'll find consensus, maybe you & Junio
>> would like to keep it as-is. I don't see why we'd expose this bad edge
>> case in configuration handling to users when it's entirely avoidable,
>> and we're still in the design phase.
>
> After trying to figure out an implementation, I have a new proposal,
> which I've shared on the other thread [1].
>
> [1] https://lore.kernel.org/git/CANQDOdcdhfGtPg0PxpXQA5gQ4x9VknKDKCCi4HEB0Z1xgnjKzg@mail.gmail.com/

This LGTM, or something simpler as Junio points out with his "too
fine-grained?" comment as a follow-up. I'm honestly quite apathetic
about what we end up with exactly as long as:

 1. We get the people who are adding these config settings to talk & see if they make
    sense in combination.

 2. We avoid the trap of hard dying on older versions.

>>
>> >> don't get why we have code like this:
>> >>
>> >>         @@ -503,10 +504,12 @@ static void unpack_all(void)
>> >>                 if (!quiet)
>> >>                         progress = start_progress(_("Unpacking objects"), nr_objects);
>> >>                 CALLOC_ARRAY(obj_list, nr_objects);
>> >>         +       plug_bulk_checkin();
>> >>                 for (i = 0; i < nr_objects; i++) {
>> >>                         unpack_one(i);
>> >>                         display_progress(progress, i + 1);
>> >>                 }
>> >>         +       unplug_bulk_checkin();
>> >>                 stop_progress(&progress);
>> >>
>> >>                 if (delta_list)
>> >>
>> >> As opposed to doing an fsync on the last object we're
>> >> processing. I.e. why do we need the step of intentionally making the
>> >> objects unavailable in the tmp-objdir, and creating a "cookie" file to
>> >> sync at the start/end, as opposed to fsyncing on the last file (which
>> >> we're writing out anyway).
>> >>
>> >> 1. https://lore.kernel.org/git/211110.86r1bogg27.gmgdl@evledraar.gmail.com/
>> >> 2. https://lore.kernel.org/git/20211111000349.GA703@neerajsi-x1.localdomain/
>> >
>> > It's important to not expose an object's final name until its contents
>> > have been fsynced
>> > to disk. We want to ensure that wherever we crash, we won't have a
>> > loose object that
>> > Git may later try to open where the filename doesn't match the content
>> > hash. I believe it's
>> > okay for a given OID to be missing, since a later command could
>> > recreate it, but an object
>> > with a wrong hash looks like it would persist until we do a git-fsck.
>>
>> Yes, we handle that rather badly, as I mentioned in some other threads,
>> but not doing the fsync on the last object v.s. a "cookie" file right
>> afterwards seems like a hail-mary at best, no?
>>
>
> I'm not quite grasping what you're saying here. Are you saying that
> using a dummy
> file instead of one of the actual objects is less likely to produce
> the desired outcome
> on actual filesystem implementations?

[...covered below...]

>> > I thought about figuring out how to sync the last object rather than some random
>> > "cookie" file, but it wasn't clear to me how I'd figure out which
>> > object is actually last
>> > from library code in a way that doesn't burden each command with
>> > somehow figuring
>> > out its last object and communicating that. The 'cookie' approach
>> > seems to lead to a cleaner
>> > interface for callers.
>>
>> The above quoted code is looping through nr_objects isn't it? Can't a
>> "do fsync" be passed down to unpack_one() when we process the last loose
>> object?
>
> Are you proposing that we do something different for unpack_objects
> versus update_index
> and git-add?  I was hoping to keep all of the users of the batch fsync
> functionality equivalent.
> For the git-add workflow and update-index, we'd need to track the most
> recent file so that we
> can go back and fsync it.  I don't believe that syncing the last
> object composes well with the existing
> implementation of those commands.

There's probably cases where we need the cookie. I just mean instead of
the API being (as seen above in the quoted part), pseudocode:

    # A
    bulk_checkin_start_make_cookie():
    n = 10
    for i in 1..n:
        write_nth(i, fsync: 0);
    bulk_checkin_end_commit_cookie();

To have it be:

    # B
    bulk_checkin_start(do_cookie: 0);
    n = 10
    for i in 1..n:
        write_nth(i, fsync: (i == n));
    bulk_checkin_end();

Or actually, presumably simpler as:

    # C
    all_fsync = bulk_checkin_mode() ? 0 : fsync_turned_on_in_general();
    end_fsync = bulk_checkin_mode() ? 1 : all_fsync;
    n = 10;
    for i in 1..n:
        write_nth(i, fsync: (i == n) ? end_fsync : all_fsync);

I.e. maybe there are cases where you really do need "A", but we're
usually (always?) writing out N objects, and we usually know it at the
same point where you'd want the plug_bulk_checkin/unplug_bulk_checkin,
so just fsyncing the last object/file/ref/whatever means we don't need
the whole ceremony of the cookie file.

I don't mind it per-se, but "B" and "C" just seem a lot simpler,
particulary since as those examples show we'll presumably want to pass
down a "do fsync?" to these in general, and we even usually have a
display_progress() in there.

So doesn't just doing "B" or "C" eliminate the need for a cookie
entirely?

Another advantage of that is that you'll presumably want such tracking
anyway even for the case of "A".

Because as soon as you have say a batch operation of writing X objects
and Y refs you'd want to track this anyway. I.e. either only fsync() on
the ref write (particularly if there's just the one ref), or on the last
ref, or for each ref and no object syncs. So this (like "C", except for
the "do_batch" in the "end_fsync" case):

    # D
    do_batch = in_existing_bulk_checkin() ? 1 : 0;
    all_fsync = bulk_checkin_mode() ? 0 : fsync_turned_on_in_general();
    end_fsync = bulk_checkin_mode() ? do_batch : all_fsync;
    n = 10;
    for i in 1..n:
        write_nth(i, fsync: (i == n) ? end_fsync : all_fsync);

I mean, usually we'd want the "all refs", I'm just thinking of a case
like "git fast-import" or other known-to-the-user batch operation.

Or, as in the case of my 4bc1fd6e394 (pack-objects: rename .idx files
into place after .bitmap files, 2021-09-09) we'd want to know that we're
writing all of say *.bitmap, *.rev where we currently fsync() all of
them, write *.bitmap, *.rev and *.pack (not sure that one is safe)
without fsync(), and then only fsync (or that and in-place move) the
*.idx.
Ævar Arnfjörð Bjarmason March 9, 2022, 11:02 p.m. UTC | #6
On Wed, Dec 01 2021, Ævar Arnfjörð Bjarmason wrote:

> On Wed, Nov 17 2021, Neeraj Singh wrote:
>
> [Very late reply, sorry]
>
>> On Tue, Nov 16, 2021 at 11:28 PM Ævar Arnfjörð Bjarmason
>> <avarab@gmail.com> wrote:
>>>
>>>
>>> On Tue, Nov 16 2021, Neeraj Singh wrote:
>>>
>>> > On Tue, Nov 16, 2021 at 12:10 AM Ævar Arnfjörð Bjarmason
>>> > <avarab@gmail.com> wrote:
>>> >>
>>> >>
>>> >> On Mon, Nov 15 2021, Neeraj K. Singh via GitGitGadget wrote:
>>> >>
>>> >> >  * Per [2], I'm leaving the fsyncObjectFiles configuration as is with
>>> >> >    'true', 'false', and 'batch'. This makes using old and new versions of
>>> >> >    git with 'batch' mode a little trickier, but hopefully people will
>>> >> >    generally be moving forward in versions.
>>> >> >
>>> >> > [1] See
>>> >> > https://lore.kernel.org/git/pull.1067.git.1635287730.gitgitgadget@gmail.com/
>>> >> > [2] https://lore.kernel.org/git/xmqqh7cimuxt.fsf@gitster.g/
>>> >>
>>> >> I really think leaving that in-place is just being unnecessarily
>>> >> cavalier. There's a lot of mixed-version environments where git is
>>> >> deployed in, and we almost never break the configuration in this way (I
>>> >> think in the past always by mistake).
>>> >
>>> >> In this case it's easy to avoid it, and coming up with a less narrow
>>> >> config model[1] seems like a good idea in any case to unify the various
>>> >> outstanding work in this area.
>>> >>
>>> >> More generally on this series, per the thread ending in [2] I really
>>> >
>>> > My primary goal in all of these changes is to move git-for-windows over to
>>> > a default of batch fsync so that it can get closer to other platforms
>>> > in performance
>>> > of 'git add' while still retaining the same level of data integrity.
>>> > I'm hoping that
>>> > most end-users are just sticking to defaults here.
>>> >
>>> > I'm happy to change the configuration schema again if there's a
>>> > consensus from the Git
>>> > community that backwards-compatibility of the configuration is
>>> > actually important to someone.
>>> >
>>> > Also, if we're doing a deeper rethink of the fsync configuration (as
>>> > prompted by this work and
>>> > Eric Wong's and Patrick Steinhardts work), do we want to retain a mode
>>> > where we fsync some
>>> > parts of the persistent repo data but not others?  If we add fsyncing
>>> > of the index in addition to the refs,
>>> > I believe we would have covered all of the critical data structures
>>> > that would be needed to find the
>>> > data that a user has added to the repo if they complete a series of
>>> > git commands and then experience
>>> > a system crash.
>>>
>>> Just talking about it is how we'll find consensus, maybe you & Junio
>>> would like to keep it as-is. I don't see why we'd expose this bad edge
>>> case in configuration handling to users when it's entirely avoidable,
>>> and we're still in the design phase.
>>
>> After trying to figure out an implementation, I have a new proposal,
>> which I've shared on the other thread [1].
>>
>> [1] https://lore.kernel.org/git/CANQDOdcdhfGtPg0PxpXQA5gQ4x9VknKDKCCi4HEB0Z1xgnjKzg@mail.gmail.com/
>
> This LGTM, or something simpler as Junio points out with his "too
> fine-grained?" comment as a follow-up. I'm honestly quite apathetic
> about what we end up with exactly as long as:
>
>  1. We get the people who are adding these config settings to talk & see if they make
>     sense in combination.
>
>  2. We avoid the trap of hard dying on older versions.
>
>>>
>>> >> don't get why we have code like this:
>>> >>
>>> >>         @@ -503,10 +504,12 @@ static void unpack_all(void)
>>> >>                 if (!quiet)
>>> >>                         progress = start_progress(_("Unpacking objects"), nr_objects);
>>> >>                 CALLOC_ARRAY(obj_list, nr_objects);
>>> >>         +       plug_bulk_checkin();
>>> >>                 for (i = 0; i < nr_objects; i++) {
>>> >>                         unpack_one(i);
>>> >>                         display_progress(progress, i + 1);
>>> >>                 }
>>> >>         +       unplug_bulk_checkin();
>>> >>                 stop_progress(&progress);
>>> >>
>>> >>                 if (delta_list)
>>> >>
>>> >> As opposed to doing an fsync on the last object we're
>>> >> processing. I.e. why do we need the step of intentionally making the
>>> >> objects unavailable in the tmp-objdir, and creating a "cookie" file to
>>> >> sync at the start/end, as opposed to fsyncing on the last file (which
>>> >> we're writing out anyway).
>>> >>
>>> >> 1. https://lore.kernel.org/git/211110.86r1bogg27.gmgdl@evledraar.gmail.com/
>>> >> 2. https://lore.kernel.org/git/20211111000349.GA703@neerajsi-x1.localdomain/
>>> >
>>> > It's important to not expose an object's final name until its contents
>>> > have been fsynced
>>> > to disk. We want to ensure that wherever we crash, we won't have a
>>> > loose object that
>>> > Git may later try to open where the filename doesn't match the content
>>> > hash. I believe it's
>>> > okay for a given OID to be missing, since a later command could
>>> > recreate it, but an object
>>> > with a wrong hash looks like it would persist until we do a git-fsck.
>>>
>>> Yes, we handle that rather badly, as I mentioned in some other threads,
>>> but not doing the fsync on the last object v.s. a "cookie" file right
>>> afterwards seems like a hail-mary at best, no?
>>>
>>
>> I'm not quite grasping what you're saying here. Are you saying that
>> using a dummy
>> file instead of one of the actual objects is less likely to produce
>> the desired outcome
>> on actual filesystem implementations?
>
> [...covered below...]
>
>>> > I thought about figuring out how to sync the last object rather than some random
>>> > "cookie" file, but it wasn't clear to me how I'd figure out which
>>> > object is actually last
>>> > from library code in a way that doesn't burden each command with
>>> > somehow figuring
>>> > out its last object and communicating that. The 'cookie' approach
>>> > seems to lead to a cleaner
>>> > interface for callers.
>>>
>>> The above quoted code is looping through nr_objects isn't it? Can't a
>>> "do fsync" be passed down to unpack_one() when we process the last loose
>>> object?
>>
>> Are you proposing that we do something different for unpack_objects
>> versus update_index
>> and git-add?  I was hoping to keep all of the users of the batch fsync
>> functionality equivalent.
>> For the git-add workflow and update-index, we'd need to track the most
>> recent file so that we
>> can go back and fsync it.  I don't believe that syncing the last
>> object composes well with the existing
>> implementation of those commands.
>
> There's probably cases where we need the cookie. I just mean instead of
> the API being (as seen above in the quoted part), pseudocode:
>
>     # A
>     bulk_checkin_start_make_cookie():
>     n = 10
>     for i in 1..n:
>         write_nth(i, fsync: 0);
>     bulk_checkin_end_commit_cookie();
>
> To have it be:
>
>     # B
>     bulk_checkin_start(do_cookie: 0);
>     n = 10
>     for i in 1..n:
>         write_nth(i, fsync: (i == n));
>     bulk_checkin_end();
>
> Or actually, presumably simpler as:
>
>     # C
>     all_fsync = bulk_checkin_mode() ? 0 : fsync_turned_on_in_general();
>     end_fsync = bulk_checkin_mode() ? 1 : all_fsync;
>     n = 10;
>     for i in 1..n:
>         write_nth(i, fsync: (i == n) ? end_fsync : all_fsync);
>
> I.e. maybe there are cases where you really do need "A", but we're
> usually (always?) writing out N objects, and we usually know it at the
> same point where you'd want the plug_bulk_checkin/unplug_bulk_checkin,
> so just fsyncing the last object/file/ref/whatever means we don't need
> the whole ceremony of the cookie file.
>
> I don't mind it per-se, but "B" and "C" just seem a lot simpler,
> particulary since as those examples show we'll presumably want to pass
> down a "do fsync?" to these in general, and we even usually have a
> display_progress() in there.
>
> So doesn't just doing "B" or "C" eliminate the need for a cookie
> entirely?
>
> Another advantage of that is that you'll presumably want such tracking
> anyway even for the case of "A".
>
> Because as soon as you have say a batch operation of writing X objects
> and Y refs you'd want to track this anyway. I.e. either only fsync() on
> the ref write (particularly if there's just the one ref), or on the last
> ref, or for each ref and no object syncs. So this (like "C", except for
> the "do_batch" in the "end_fsync" case):
>
>     # D
>     do_batch = in_existing_bulk_checkin() ? 1 : 0;
>     all_fsync = bulk_checkin_mode() ? 0 : fsync_turned_on_in_general();
>     end_fsync = bulk_checkin_mode() ? do_batch : all_fsync;
>     n = 10;
>     for i in 1..n:
>         write_nth(i, fsync: (i == n) ? end_fsync : all_fsync);
>
> I mean, usually we'd want the "all refs", I'm just thinking of a case
> like "git fast-import" or other known-to-the-user batch operation.
>
> Or, as in the case of my 4bc1fd6e394 (pack-objects: rename .idx files
> into place after .bitmap files, 2021-09-09) we'd want to know that we're
> writing all of say *.bitmap, *.rev where we currently fsync() all of
> them, write *.bitmap, *.rev and *.pack (not sure that one is safe)
> without fsync(), and then only fsync (or that and in-place move) the
> *.idx.

Replying to an old-ish E-Mail of mine with some more thought that came
to mind after[1] (another recently resurrected fsync() thread).

I wonder if there's another twist on the plan outlined in [2] that would
be both portable & efficient, i.e. the "slow" POSIX way to write files
A..Z is to open/write/close/fsync each one, so we'll trigger a HW flush
N times.

And as we've discussed, doing it just on Z will implicitly flush A..Y on
common OS's in the wild, which we're taking advantage of here.

But aside from the rename() dance in[2], what do those OS's do if you
write A..Z, fsync() the "fd" for Z, and then fsync A..Y (or, presumably
equivalently, in reverse order: Y..A).

I'd think they'd be smart enough to know that they already implicitly
flushed that data since Z was flushend, and make those fsync()'s a
rather cheap noop.

But I don't know, hence the question.

If that's true then perhaps it's a path towards having our cake and
eating it too in some cases?

I.e. an FS that would flush A..Y if we flush Z would do so quickly and
reliably, whereas a FS that doesn't have such an optimization might be
just as slow for all of A..Y, but at least it'll be safe.

1. https://lore.kernel.org/git/220309.867d93lztw.gmgdl@evledraar.gmail.com/
2. https://lore.kernel.org/git/e1747ce00af7ab3170a69955b07d995d5321d6f3.1637020263.git.gitgitgadget@gmail.com/
Neeraj Singh March 10, 2022, 1:16 a.m. UTC | #7
On Wed, Mar 9, 2022 at 3:10 PM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>
> Replying to an old-ish E-Mail of mine with some more thought that came
> to mind after[1] (another recently resurrected fsync() thread).
>
> I wonder if there's another twist on the plan outlined in [2] that would
> be both portable & efficient, i.e. the "slow" POSIX way to write files
> A..Z is to open/write/close/fsync each one, so we'll trigger a HW flush
> N times.
>
> And as we've discussed, doing it just on Z will implicitly flush A..Y on
> common OS's in the wild, which we're taking advantage of here.
>
> But aside from the rename() dance in[2], what do those OS's do if you
> write A..Z, fsync() the "fd" for Z, and then fsync A..Y (or, presumably
> equivalently, in reverse order: Y..A).
>
> I'd think they'd be smart enough to know that they already implicitly
> flushed that data since Z was flushend, and make those fsync()'s a
> rather cheap noop.
>
> But I don't know, hence the question.
>
> If that's true then perhaps it's a path towards having our cake and
> eating it too in some cases?
>
> I.e. an FS that would flush A..Y if we flush Z would do so quickly and
> reliably, whereas a FS that doesn't have such an optimization might be
> just as slow for all of A..Y, but at least it'll be safe.
>
> 1. https://lore.kernel.org/git/220309.867d93lztw.gmgdl@evledraar.gmail.com/
> 2. https://lore.kernel.org/git/e1747ce00af7ab3170a69955b07d995d5321d6f3.1637020263.git.gitgitgadget@gmail.com/

The important angle here is that we need some way to indicate to the
OS what A..Y is before we fsync on Z.  I.e. the OS will cache any
writes in memory until some sync-ish operation is done on *that
specific file*.  Syncing just 'Z' with no sync operations on A..Y
doesn't indicate that A..Y would get written out.  Apparently the bad
old ext3 behavior was similar to what you're proposing where a sync on
'Z' would imply something about independent files.

Here's an interesting paper I recently came across that proposes the
interface we'd really want, 'syncv':
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.924.1168&rep=rep1&type=pdf.

Thanks,
Neeraj
Ævar Arnfjörð Bjarmason March 10, 2022, 2:01 p.m. UTC | #8
On Wed, Mar 09 2022, Neeraj Singh wrote:

> On Wed, Mar 9, 2022 at 3:10 PM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>>
>> Replying to an old-ish E-Mail of mine with some more thought that came
>> to mind after[1] (another recently resurrected fsync() thread).
>>
>> I wonder if there's another twist on the plan outlined in [2] that would
>> be both portable & efficient, i.e. the "slow" POSIX way to write files
>> A..Z is to open/write/close/fsync each one, so we'll trigger a HW flush
>> N times.
>>
>> And as we've discussed, doing it just on Z will implicitly flush A..Y on
>> common OS's in the wild, which we're taking advantage of here.
>>
>> But aside from the rename() dance in[2], what do those OS's do if you
>> write A..Z, fsync() the "fd" for Z, and then fsync A..Y (or, presumably
>> equivalently, in reverse order: Y..A).
>>
>> I'd think they'd be smart enough to know that they already implicitly
>> flushed that data since Z was flushend, and make those fsync()'s a
>> rather cheap noop.
>>
>> But I don't know, hence the question.
>>
>> If that's true then perhaps it's a path towards having our cake and
>> eating it too in some cases?
>>
>> I.e. an FS that would flush A..Y if we flush Z would do so quickly and
>> reliably, whereas a FS that doesn't have such an optimization might be
>> just as slow for all of A..Y, but at least it'll be safe.
>>
>> 1. https://lore.kernel.org/git/220309.867d93lztw.gmgdl@evledraar.gmail.com/
>> 2. https://lore.kernel.org/git/e1747ce00af7ab3170a69955b07d995d5321d6f3.1637020263.git.gitgitgadget@gmail.com/
>
> The important angle here is that we need some way to indicate to the
> OS what A..Y is before we fsync on Z.  I.e. the OS will cache any
> writes in memory until some sync-ish operation is done on *that
> specific file*.  Syncing just 'Z' with no sync operations on A..Y
> doesn't indicate that A..Y would get written out.  Apparently the bad
> old ext3 behavior was similar to what you're proposing where a sync on
> 'Z' would imply something about independent files.

It's certainly starting to sound like I'm misunderstanding this whole
thing, but just to clarify again I'm talking about the sort of loops
mentioned upthread in my [1]. I.e. you have (to copy from that E-Mail):

    bulk_checkin_start_make_cookie():
    n = 10
    for i in 1..n:
        write_nth(i, fsync: 0);
    bulk_checkin_end_commit_cookie();

I.e. we have a "cookie" file in a given dir (where, in this example,
we'd also write files A..Z). I.e. we write:

    cookie
    {A..Z}
    cookie

And then only fsync() on the "cookie" at the end, which "flushes" the
A..Z updates on some FS's (again, all per my possibly-incorrect
understanding).

Which is why I proposed that in many/all cases we could do this,
i.e. just the same without the "cookie" file (which AFAICT isn't needed
per-se, but was just added to make the API a bit simpler in not needing
to modify the relevant loops):

    all_fsync = bulk_checkin_mode() ? 0 : fsync_turned_on_in_general();
    end_fsync = bulk_checkin_mode() ? 1 : all_fsync;
    n = 10;
    for i in 1..n:
        write_nth(i, fsync: (i == n) ? end_fsync : all_fsync);

I.e. we don't pay the cost of the fsync() as we're in the loop, but just
for the last file, which "flushes" the rest.

So far all of that's a paraphrasing of existing exchanges, but what I
was wondering now in[2] is if we add this to this last example above:

    for i in 1..n-1:
        fsync_nth(i)

Wouldn't those same OS's that are being clever about deferring the
syncing of A..Z as a "batch" be clever enough to turn that (re-)syncing
into a NOOP?

Of course in this case we'd need to keep the fd's open and be clever
about E[MN]FILE (i.e. "Too many open..."), or do an fsync() every Nth
for some reasonable Nth, e.g. somewhere in the 2^10..2^12 range.

But *if* this works it seems to me to be something we might be able to
enable when "core.fsyncObjectFiles" is configured on those systems.

I.e. the implicit assumption with that configuration was that if we sync
N loose objects and then update and fsync the ref that the FS would
queue up the ref update after the syncing of the loose objects.

This new "cookie" (or my suggested "fsync last of N") is basically
making the same assumption, just with the slight twist that some OSs/FSs
are known to behave like that on a per-subdir basis, no?

> Here's an interesting paper I recently came across that proposes the
> interface we'd really want, 'syncv':
> https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.924.1168&rep=rep1&type=pdf.

1. https://lore.kernel.org/git/211201.864k7sbdjt.gmgdl@evledraar.gmail.com/
2. https://lore.kernel.org/git/220310.86lexilo3d.gmgdl@evledraar.gmail.com/
Neeraj Singh March 10, 2022, 5:52 p.m. UTC | #9
On Thu, Mar 10, 2022 at 6:17 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Wed, Mar 09 2022, Neeraj Singh wrote:
>
> > On Wed, Mar 9, 2022 at 3:10 PM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> >>
> >> Replying to an old-ish E-Mail of mine with some more thought that came
> >> to mind after[1] (another recently resurrected fsync() thread).
> >>
> >> I wonder if there's another twist on the plan outlined in [2] that would
> >> be both portable & efficient, i.e. the "slow" POSIX way to write files
> >> A..Z is to open/write/close/fsync each one, so we'll trigger a HW flush
> >> N times.
> >>
> >> And as we've discussed, doing it just on Z will implicitly flush A..Y on
> >> common OS's in the wild, which we're taking advantage of here.
> >>
> >> But aside from the rename() dance in[2], what do those OS's do if you
> >> write A..Z, fsync() the "fd" for Z, and then fsync A..Y (or, presumably
> >> equivalently, in reverse order: Y..A).
> >>
> >> I'd think they'd be smart enough to know that they already implicitly
> >> flushed that data since Z was flushend, and make those fsync()'s a
> >> rather cheap noop.
> >>
> >> But I don't know, hence the question.
> >>
> >> If that's true then perhaps it's a path towards having our cake and
> >> eating it too in some cases?
> >>
> >> I.e. an FS that would flush A..Y if we flush Z would do so quickly and
> >> reliably, whereas a FS that doesn't have such an optimization might be
> >> just as slow for all of A..Y, but at least it'll be safe.
> >>
> >> 1. https://lore.kernel.org/git/220309.867d93lztw.gmgdl@evledraar.gmail.com/
> >> 2. https://lore.kernel.org/git/e1747ce00af7ab3170a69955b07d995d5321d6f3.1637020263.git.gitgitgadget@gmail.com/
> >
> > The important angle here is that we need some way to indicate to the
> > OS what A..Y is before we fsync on Z.  I.e. the OS will cache any
> > writes in memory until some sync-ish operation is done on *that
> > specific file*.  Syncing just 'Z' with no sync operations on A..Y
> > doesn't indicate that A..Y would get written out.  Apparently the bad
> > old ext3 behavior was similar to what you're proposing where a sync on
> > 'Z' would imply something about independent files.
>
> It's certainly starting to sound like I'm misunderstanding this whole
> thing, but just to clarify again I'm talking about the sort of loops
> mentioned upthread in my [1]. I.e. you have (to copy from that E-Mail):
>
>     bulk_checkin_start_make_cookie():
>     n = 10
>     for i in 1..n:
>         write_nth(i, fsync: 0);
>     bulk_checkin_end_commit_cookie();
>
> I.e. we have a "cookie" file in a given dir (where, in this example,
> we'd also write files A..Z). I.e. we write:
>
>     cookie
>     {A..Z}
>     cookie
>
> And then only fsync() on the "cookie" at the end, which "flushes" the
> A..Z updates on some FS's (again, all per my possibly-incorrect
> understanding).
>
> Which is why I proposed that in many/all cases we could do this,
> i.e. just the same without the "cookie" file (which AFAICT isn't needed
> per-se, but was just added to make the API a bit simpler in not needing
> to modify the relevant loops):
>
>     all_fsync = bulk_checkin_mode() ? 0 : fsync_turned_on_in_general();
>     end_fsync = bulk_checkin_mode() ? 1 : all_fsync;
>     n = 10;
>     for i in 1..n:
>         write_nth(i, fsync: (i == n) ? end_fsync : all_fsync);
>
> I.e. we don't pay the cost of the fsync() as we're in the loop, but just
> for the last file, which "flushes" the rest.
>
> So far all of that's a paraphrasing of existing exchanges, but what I
> was wondering now in[2] is if we add this to this last example above:
>
>     for i in 1..n-1:
>         fsync_nth(i)
>
> Wouldn't those same OS's that are being clever about deferring the
> syncing of A..Z as a "batch" be clever enough to turn that (re-)syncing
> into a NOOP?
>
> Of course in this case we'd need to keep the fd's open and be clever
> about E[MN]FILE (i.e. "Too many open..."), or do an fsync() every Nth
> for some reasonable Nth, e.g. somewhere in the 2^10..2^12 range.
>
> But *if* this works it seems to me to be something we might be able to
> enable when "core.fsyncObjectFiles" is configured on those systems.
>
> I.e. the implicit assumption with that configuration was that if we sync
> N loose objects and then update and fsync the ref that the FS would
> queue up the ref update after the syncing of the loose objects.
>
> This new "cookie" (or my suggested "fsync last of N") is basically
> making the same assumption, just with the slight twist that some OSs/FSs
> are known to behave like that on a per-subdir basis, no?
>
> > Here's an interesting paper I recently came across that proposes the
> > interface we'd really want, 'syncv':
> > https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.924.1168&rep=rep1&type=pdf.
>
> 1. https://lore.kernel.org/git/211201.864k7sbdjt.gmgdl@evledraar.gmail.com/
> 2. https://lore.kernel.org/git/220310.86lexilo3d.gmgdl@evledraar.gmail.com/

On the actual FS implementations in the three common OSes I'm familiar with
(macOS, Windows, Linux), each file has its own independent data caching in OS
memory.  Fsyncing one of them doesn't necessarily imply writing out
the OS cache for
any other file.  Except, apparently, on ext3 in data=ordered mode, but
that FS is no
longer common.  On Linux, we use sync_file_range to get the OS to
write the in-memory
cache to the storage hardware, which is what makes the data
'available' to fsync.

Now, we could consider an implementation where we call sync_file_range
without the
wait flags (i.e. without SYNC_FILE_RANGE_WAIT_BEFORE and
SYNC_FILE_RANGE_WAIT_AFTER). Then we could later fsync every file (or batch of
files), which might be more efficient if the OS coalesces the disk
cache flushes.  I expect
that this method is less likely to give us the desired performance on
common linux FSes,
however.

The macOS and Windows APIs are defined a bit differently from Linux.
In both those OSes,
we're actually calling fsync-equivalent APIs that are defined to write
back all the relevant data and
metadata, just without the storage cache flush.

So to summarize:
1. We need to do write(2) to get the data out of Git and into the OS
filesystem cache.
2. We need some API (macOS-fsync, Windows-NtFlushBuffersFileEx,
Linux-sync_file_range)
   to transfer the data per-file to the storage controller, but
without flushing the storage controller.
3. We need some api (macOS-F_FULLFSYNC, Windows-NtFlushBuffersFile, Linux-fsync)
   to push the storage controller cache to durable media. This only
needs to be done once
   at the end to push out the data made available in step (2).


Thanks,
Neeraj
Randall S. Becker March 10, 2022, 6:08 p.m. UTC | #10
On March 10, 2022 12:53 PM, Neeraj Singh wrote:
>On Thu, Mar 10, 2022 at 6:17 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com>
>wrote:
>>
>>
>> On Wed, Mar 09 2022, Neeraj Singh wrote:
>>
>> > On Wed, Mar 9, 2022 at 3:10 PM Ævar Arnfjörð Bjarmason
><avarab@gmail.com> wrote:
>> >>
>> >> Replying to an old-ish E-Mail of mine with some more thought that
>> >> came to mind after[1] (another recently resurrected fsync() thread).
>> >>
>> >> I wonder if there's another twist on the plan outlined in [2] that
>> >> would be both portable & efficient, i.e. the "slow" POSIX way to
>> >> write files A..Z is to open/write/close/fsync each one, so we'll
>> >> trigger a HW flush N times.
>> >>
>> >> And as we've discussed, doing it just on Z will implicitly flush
>> >> A..Y on common OS's in the wild, which we're taking advantage of here.
>> >>
>> >> But aside from the rename() dance in[2], what do those OS's do if
>> >> you write A..Z, fsync() the "fd" for Z, and then fsync A..Y (or,
>> >> presumably equivalently, in reverse order: Y..A).
>> >>
>> >> I'd think they'd be smart enough to know that they already
>> >> implicitly flushed that data since Z was flushend, and make those
>> >> fsync()'s a rather cheap noop.
>> >>
>> >> But I don't know, hence the question.
>> >>
>> >> If that's true then perhaps it's a path towards having our cake and
>> >> eating it too in some cases?
>> >>
>> >> I.e. an FS that would flush A..Y if we flush Z would do so quickly
>> >> and reliably, whereas a FS that doesn't have such an optimization
>> >> might be just as slow for all of A..Y, but at least it'll be safe.
>> >>
>> >> 1.
>> >> https://lore.kernel.org/git/220309.867d93lztw.gmgdl@evledraar.gmail
>> >> .com/ 2.
>> >> https://lore.kernel.org/git/e1747ce00af7ab3170a69955b07d995d5321d6f
>> >> 3.1637020263.git.gitgitgadget@gmail.com/
>> >
>> > The important angle here is that we need some way to indicate to the
>> > OS what A..Y is before we fsync on Z.  I.e. the OS will cache any
>> > writes in memory until some sync-ish operation is done on *that
>> > specific file*.  Syncing just 'Z' with no sync operations on A..Y
>> > doesn't indicate that A..Y would get written out.  Apparently the
>> > bad old ext3 behavior was similar to what you're proposing where a
>> > sync on 'Z' would imply something about independent files.
>>
>> It's certainly starting to sound like I'm misunderstanding this whole
>> thing, but just to clarify again I'm talking about the sort of loops
>> mentioned upthread in my [1]. I.e. you have (to copy from that E-Mail):
>>
>>     bulk_checkin_start_make_cookie():
>>     n = 10
>>     for i in 1..n:
>>         write_nth(i, fsync: 0);
>>     bulk_checkin_end_commit_cookie();
>>
>> I.e. we have a "cookie" file in a given dir (where, in this example,
>> we'd also write files A..Z). I.e. we write:
>>
>>     cookie
>>     {A..Z}
>>     cookie
>>
>> And then only fsync() on the "cookie" at the end, which "flushes" the
>> A..Z updates on some FS's (again, all per my possibly-incorrect
>> understanding).
>>
>> Which is why I proposed that in many/all cases we could do this, i.e.
>> just the same without the "cookie" file (which AFAICT isn't needed
>> per-se, but was just added to make the API a bit simpler in not
>> needing to modify the relevant loops):
>>
>>     all_fsync = bulk_checkin_mode() ? 0 : fsync_turned_on_in_general();
>>     end_fsync = bulk_checkin_mode() ? 1 : all_fsync;
>>     n = 10;
>>     for i in 1..n:
>>         write_nth(i, fsync: (i == n) ? end_fsync : all_fsync);
>>
>> I.e. we don't pay the cost of the fsync() as we're in the loop, but
>> just for the last file, which "flushes" the rest.
>>
>> So far all of that's a paraphrasing of existing exchanges, but what I
>> was wondering now in[2] is if we add this to this last example above:
>>
>>     for i in 1..n-1:
>>         fsync_nth(i)
>>
>> Wouldn't those same OS's that are being clever about deferring the
>> syncing of A..Z as a "batch" be clever enough to turn that
>> (re-)syncing into a NOOP?
>>
>> Of course in this case we'd need to keep the fd's open and be clever
>> about E[MN]FILE (i.e. "Too many open..."), or do an fsync() every Nth
>> for some reasonable Nth, e.g. somewhere in the 2^10..2^12 range.
>>
>> But *if* this works it seems to me to be something we might be able to
>> enable when "core.fsyncObjectFiles" is configured on those systems.
>>
>> I.e. the implicit assumption with that configuration was that if we
>> sync N loose objects and then update and fsync the ref that the FS
>> would queue up the ref update after the syncing of the loose objects.
>>
>> This new "cookie" (or my suggested "fsync last of N") is basically
>> making the same assumption, just with the slight twist that some
>> OSs/FSs are known to behave like that on a per-subdir basis, no?
>>
>> > Here's an interesting paper I recently came across that proposes the
>> > interface we'd really want, 'syncv':
>> >
>https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.924.1168&rep=rep1
>&type=pdf.
>>
>> 1.
>> https://lore.kernel.org/git/211201.864k7sbdjt.gmgdl@evledraar.gmail.co
>> m/ 2.
>> https://lore.kernel.org/git/220310.86lexilo3d.gmgdl@evledraar.gmail.co
>> m/
>
>On the actual FS implementations in the three common OSes I'm familiar with
>(macOS, Windows, Linux), each file has its own independent data caching in OS
>memory.  Fsyncing one of them doesn't necessarily imply writing out the OS cache
>for any other file.  Except, apparently, on ext3 in data=ordered mode, but that FS
>is no longer common.  On Linux, we use sync_file_range to get the OS to write the
>in-memory cache to the storage hardware, which is what makes the data
>'available' to fsync.
>
>Now, we could consider an implementation where we call sync_file_range
>without the wait flags (i.e. without SYNC_FILE_RANGE_WAIT_BEFORE and
>SYNC_FILE_RANGE_WAIT_AFTER). Then we could later fsync every file (or batch
>of files), which might be more efficient if the OS coalesces the disk cache flushes.  I
>expect that this method is less likely to give us the desired performance on
>common linux FSes, however.
>
>The macOS and Windows APIs are defined a bit differently from Linux.
>In both those OSes,
>we're actually calling fsync-equivalent APIs that are defined to write back all the
>relevant data and metadata, just without the storage cache flush.
>
>So to summarize:
>1. We need to do write(2) to get the data out of Git and into the OS filesystem
>cache.
>2. We need some API (macOS-fsync, Windows-NtFlushBuffersFileEx,
>Linux-sync_file_range)
>   to transfer the data per-file to the storage controller, but without flushing the
>storage controller.
>3. We need some api (macOS-F_FULLFSYNC, Windows-NtFlushBuffersFile, Linux-
>fsync)
>   to push the storage controller cache to durable media. This only needs to be
>done once
>   at the end to push out the data made available in step (2).

While this might not be a surprise, on some platforms fsync is a thread-blocking operation. When the OS has kernel threads, fsync can potentially cause multiple processes (if implemented that way) to block, particularly where an fd is shared across threads (and thus processes), which may end up causing a deadlock. We might need to keep an eye out for this type of situation in the future and at least try to test for it. I cannot actually see a situation where this would occur in git, but that does not mean it is impossible. Food for thought.
--Randall
Neeraj Singh March 10, 2022, 6:43 p.m. UTC | #11
On Thu, Mar 10, 2022 at 10:08 AM <rsbecker@nexbridge.com> wrote:
> While this might not be a surprise, on some platforms fsync is a thread-blocking operation. When the OS has kernel threads, fsync can potentially cause multiple processes (if implemented that way) to block, particularly where an fd is shared across threads (and thus processes), which may end up causing a deadlock. We might need to keep an eye out for this type of situation in the future and at least try to test for it. I cannot actually see a situation where this would occur in git, but that does not mean it is impossible. Food for thought.
> --Randall

fsync is expected to block the calling thread until the underlying
data is durable.  Unless the OS somehow depends on the git process to
make progress before fsync can complete, there should be no deadlock,
since there would be no cycle in the waiting graph.  This could be a
problem for FUSE implementations that are backed by Git, but they
already have to deal with that possiblity today and this patch series
doesn't change anything.
Randall S. Becker March 10, 2022, 6:48 p.m. UTC | #12
On March 10, 2022 1:43 PM, Neeraj Singh wrote:
>On Thu, Mar 10, 2022 at 10:08 AM <rsbecker@nexbridge.com> wrote:
>> While this might not be a surprise, on some platforms fsync is a thread-blocking
>operation. When the OS has kernel threads, fsync can potentially cause multiple
>processes (if implemented that way) to block, particularly where an fd is shared
>across threads (and thus processes), which may end up causing a deadlock. We
>might need to keep an eye out for this type of situation in the future and at least
>try to test for it. I cannot actually see a situation where this would occur in git, but
>that does not mean it is impossible. Food for thought.
>> --Randall
>
>fsync is expected to block the calling thread until the underlying data is durable.
>Unless the OS somehow depends on the git process to make progress before
>fsync can complete, there should be no deadlock, since there would be no cycle in
>the waiting graph.  This could be a problem for FUSE implementations that are
>backed by Git, but they already have to deal with that possiblity today and this
>patch series doesn't change anything.

That assumption is based on a specific threading model. In cooperative user-thread models, fsync is process-blocking. While fsync, by spec is required to block the thread, there are no limitations on blocking everything else. In some systems, an fsync can block the entire file system. Just pointing that out.