diff mbox series

[v3,01/11] bulk-checkin: rebrand plug/unplug APIs as 'odb transactions'

Message ID 53261f0099d53524155464fe79d10f9605fe93aa.1648097906.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series core.fsyncmethod: add 'batch' mode for faster fsyncing of multiple objects | expand

Commit Message

Neeraj Singh (WINDOWS-SFS) March 24, 2022, 4:58 a.m. UTC
From: Neeraj Singh <neerajsi@microsoft.com>

Make it clearer in the naming and documentation of the plug_bulk_checkin
and unplug_bulk_checkin APIs that they can be thought of as
a "transaction" to optimize operations on the object database.

Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
---
 builtin/add.c  |  4 ++--
 bulk-checkin.c |  4 ++--
 bulk-checkin.h | 14 ++++++++++++--
 3 files changed, 16 insertions(+), 6 deletions(-)

Comments

Ævar Arnfjörð Bjarmason March 24, 2022, 4:10 p.m. UTC | #1
On Thu, Mar 24 2022, Neeraj Singh via GitGitGadget wrote:

> From: Neeraj Singh <neerajsi@microsoft.com>
>
> Make it clearer in the naming and documentation of the plug_bulk_checkin
> and unplug_bulk_checkin APIs that they can be thought of as
> a "transaction" to optimize operations on the object database.
>
> Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
> ---
>  builtin/add.c  |  4 ++--
>  bulk-checkin.c |  4 ++--
>  bulk-checkin.h | 14 ++++++++++++--
>  3 files changed, 16 insertions(+), 6 deletions(-)
>
> diff --git a/builtin/add.c b/builtin/add.c
> index 3ffb86a4338..9bf37ceae8e 100644
> --- a/builtin/add.c
> +++ b/builtin/add.c
> @@ -670,7 +670,7 @@ int cmd_add(int argc, const char **argv, const char *prefix)
>  		string_list_clear(&only_match_skip_worktree, 0);
>  	}
>  
> -	plug_bulk_checkin();
> +	begin_odb_transaction();
>  
>  	if (add_renormalize)
>  		exit_status |= renormalize_tracked_files(&pathspec, flags);
> @@ -682,7 +682,7 @@ int cmd_add(int argc, const char **argv, const char *prefix)
>  
>  	if (chmod_arg && pathspec.nr)
>  		exit_status |= chmod_pathspec(&pathspec, chmod_arg[0], show_only);
> -	unplug_bulk_checkin();
> +	end_odb_transaction();

Aside from anything else we've (dis)agreed on, I found this part really
odd when hacking on my RFC-on-top, i.e. originally I (wrongly) thought
the plug_bulk_checkin() was something that originated with this series
which adds the "bulk" mode.

But no, on second inspection it's a thing Junio added a long time ago so
that in this case we "stream to N pack" where we'd otherwise add N loose
objects.

Which, and I think Junio brought this up in an earlier round, but I
didn't fully understand that at the time makes this whole thing quite
odd to me.

So first, shouldn't we add this begin_odb_transaction() as a new thing?
I.e. surely wanting to do that object target redirection within a given
begin/end "scope" should be orthagonal to how fsync() happens within
that "scope", though in this case that happens to correspond.

And secondly, per the commit message and comment when it was added in
(568508e7657 (bulk-checkin: replace fast-import based implementation,
2011-10-28)) is it something we need *for that purpose* with the series
to unpack-objects without malloc()ing the size of the blob[1].

And, if so and orthagonal to that: If we know how to either stream N
objects to a PACK (as fast-import does), *and* we now (or SOON) know how
to stream loose objects without using size(blob) amounts of memory,
doesn't the "optimize fsync()" rather want to make use of the
stream-to-pack approach?

I.e. have you tried for the caseses where we create say 1k objects for
"git stash" tried to stream those to a pack? How does that compare (both
with/without the fsync changes).

I.e. I do worry (also per [2]) that while the whole "bulk fsync" is neat
(and I think can use it in either case, to defer object syncs until the
"index" or "ref" sync, as my RFC does) I worry that we're adding a bunch
of configuration and complexity for something that:

 1. Ultimately isn't all that important, as already for part of it we
    can mostly configure it away. I.e. "git-unpack-objects" v.s. writing
    a pack, cf. transfer.unpackLimit)
 2. We don't have #1 for "add" and "update-index", but if we stream to
    packs there is there any remaining benefit in practice?

1. https://lore.kernel.org/git/cover-v11-0.8-00000000000-20220319T001411Z-avarab@gmail.com/
2. https://lore.kernel.org/git/220323.86fsn8ohg8.gmgdl@evledraar.gmail.com/
Neeraj Singh March 24, 2022, 5:52 p.m. UTC | #2
On Thu, Mar 24, 2022 at 9:24 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Thu, Mar 24 2022, Neeraj Singh via GitGitGadget wrote:
>
> > From: Neeraj Singh <neerajsi@microsoft.com>
> >
> > Make it clearer in the naming and documentation of the plug_bulk_checkin
> > and unplug_bulk_checkin APIs that they can be thought of as
> > a "transaction" to optimize operations on the object database.
> >
> > Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
> > ---
> >  builtin/add.c  |  4 ++--
> >  bulk-checkin.c |  4 ++--
> >  bulk-checkin.h | 14 ++++++++++++--
> >  3 files changed, 16 insertions(+), 6 deletions(-)
> >
> > diff --git a/builtin/add.c b/builtin/add.c
> > index 3ffb86a4338..9bf37ceae8e 100644
> > --- a/builtin/add.c
> > +++ b/builtin/add.c
> > @@ -670,7 +670,7 @@ int cmd_add(int argc, const char **argv, const char *prefix)
> >               string_list_clear(&only_match_skip_worktree, 0);
> >       }
> >
> > -     plug_bulk_checkin();
> > +     begin_odb_transaction();
> >
> >       if (add_renormalize)
> >               exit_status |= renormalize_tracked_files(&pathspec, flags);
> > @@ -682,7 +682,7 @@ int cmd_add(int argc, const char **argv, const char *prefix)
> >
> >       if (chmod_arg && pathspec.nr)
> >               exit_status |= chmod_pathspec(&pathspec, chmod_arg[0], show_only);
> > -     unplug_bulk_checkin();
> > +     end_odb_transaction();
>
> Aside from anything else we've (dis)agreed on, I found this part really
> odd when hacking on my RFC-on-top, i.e. originally I (wrongly) thought
> the plug_bulk_checkin() was something that originated with this series
> which adds the "bulk" mode.
>
> But no, on second inspection it's a thing Junio added a long time ago so
> that in this case we "stream to N pack" where we'd otherwise add N loose
> objects.
>
> Which, and I think Junio brought this up in an earlier round, but I
> didn't fully understand that at the time makes this whole thing quite
> odd to me.
>
> So first, shouldn't we add this begin_odb_transaction() as a new thing?
> I.e. surely wanting to do that object target redirection within a given
> begin/end "scope" should be orthagonal to how fsync() happens within
> that "scope", though in this case that happens to correspond.
>
> And secondly, per the commit message and comment when it was added in
> (568508e7657 (bulk-checkin: replace fast-import based implementation,
> 2011-10-28)) is it something we need *for that purpose* with the series
> to unpack-objects without malloc()ing the size of the blob[1].
>

The original change seems to be about optimizing addition of
successive large blobs to the ODB when we know we have a large batch.
It's a batch-mode optimization for the ODB, similar to my patch
series, just targeting large blobs rather than small blobs/trees.  It
also has the same property that the added data is "invisible" until
the transaction ends.

> And, if so and orthagonal to that: If we know how to either stream N
> objects to a PACK (as fast-import does), *and* we now (or SOON) know how
> to stream loose objects without using size(blob) amounts of memory,
> doesn't the "optimize fsync()" rather want to make use of the
> stream-to-pack approach?
>
> I.e. have you tried for the caseses where we create say 1k objects for
> "git stash" tried to stream those to a pack? How does that compare (both
> with/without the fsync changes).
>
> I.e. I do worry (also per [2]) that while the whole "bulk fsync" is neat
> (and I think can use it in either case, to defer object syncs until the
> "index" or "ref" sync, as my RFC does) I worry that we're adding a bunch
> of configuration and complexity for something that:
>
>  1. Ultimately isn't all that important, as already for part of it we
>     can mostly configure it away. I.e. "git-unpack-objects" v.s. writing
>     a pack, cf. transfer.unpackLimit)
>  2. We don't have #1 for "add" and "update-index", but if we stream to
>     packs there is there any remaining benefit in practice?
>
> 1. https://lore.kernel.org/git/cover-v11-0.8-00000000000-20220319T001411Z-avarab@gmail.com/
> 2. https://lore.kernel.org/git/220323.86fsn8ohg8.gmgdl@evledraar.gmail.com/

Stream to pack is a good idea.  But I think we'd want a way to append
to the most recent pack so that we don't explode the number of packs,
which seems to impose a linear cost on ODB operations, at least to
load up the indexes.  I think this is orthogonal and we can always
change the meaning of batch mode to use a pack mechanism when such a
mechanism is ready.

Thanks,
Neeraj
diff mbox series

Patch

diff --git a/builtin/add.c b/builtin/add.c
index 3ffb86a4338..9bf37ceae8e 100644
--- a/builtin/add.c
+++ b/builtin/add.c
@@ -670,7 +670,7 @@  int cmd_add(int argc, const char **argv, const char *prefix)
 		string_list_clear(&only_match_skip_worktree, 0);
 	}
 
-	plug_bulk_checkin();
+	begin_odb_transaction();
 
 	if (add_renormalize)
 		exit_status |= renormalize_tracked_files(&pathspec, flags);
@@ -682,7 +682,7 @@  int cmd_add(int argc, const char **argv, const char *prefix)
 
 	if (chmod_arg && pathspec.nr)
 		exit_status |= chmod_pathspec(&pathspec, chmod_arg[0], show_only);
-	unplug_bulk_checkin();
+	end_odb_transaction();
 
 finish:
 	if (write_locked_index(&the_index, &lock_file,
diff --git a/bulk-checkin.c b/bulk-checkin.c
index 6d6c37171c9..a16ae3c629d 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -285,12 +285,12 @@  int index_bulk_checkin(struct object_id *oid,
 	return status;
 }
 
-void plug_bulk_checkin(void)
+void begin_odb_transaction(void)
 {
 	state.plugged = 1;
 }
 
-void unplug_bulk_checkin(void)
+void end_odb_transaction(void)
 {
 	state.plugged = 0;
 	if (state.f)
diff --git a/bulk-checkin.h b/bulk-checkin.h
index b26f3dc3b74..69a94422ac7 100644
--- a/bulk-checkin.h
+++ b/bulk-checkin.h
@@ -10,7 +10,17 @@  int index_bulk_checkin(struct object_id *oid,
 		       int fd, size_t size, enum object_type type,
 		       const char *path, unsigned flags);
 
-void plug_bulk_checkin(void);
-void unplug_bulk_checkin(void);
+/*
+ * Tell the object database to optimize for adding
+ * multiple objects. end_odb_transaction must be called
+ * to make new objects visible.
+ */
+void begin_odb_transaction(void);
+
+/*
+ * Tell the object database to make any objects from the
+ * current transaction visible.
+ */
+void end_odb_transaction(void);
 
 #endif