diff mbox series

[v5,01/14] bulk-checkin: rename 'state' variable and separate 'plugged' boolean

Message ID c7a2a7efe6d532fc7fce1352b1dfce640cc9f2f6.1648616734.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series core.fsyncmethod: add 'batch' mode for faster fsyncing of multiple objects | expand

Commit Message

Neeraj Singh (WINDOWS-SFS) March 30, 2022, 5:05 a.m. UTC
From: Neeraj Singh <neerajsi@microsoft.com>

This commit prepares for adding batch-fsync to the bulk-checkin
infrastructure.

The bulk-checkin infrastructure is currently used to batch up addition
of large blobs to a packfile. When a blob is larger than
big_file_threshold, we unconditionally add it to a pack. If bulk
checkins are 'plugged', we allow multiple large blobs to be added to a
single pack until we reach the packfile size limit; otherwise, we simply
make a new packfile for each large blob. The 'unplug' call tells us when
the series of blob additions is done so that we can finish the packfiles
and make their objects available to subsequent operations.

Stated another way, bulk-checkin allows callers to define a transaction
that adds multiple objects to the object database, where the object
database can optimize its internal operations within the transaction
boundary.

Batched fsync will fit into bulk-checkin by taking advantage of the
plug/unplug functionality to determine the appropriate time to fsync
and make newly-added objects available in the primary object database.

* Rename 'state' variable to 'bulk_checkin_state', since we will later
  be adding 'bulk_fsync_objdir'.  This also makes the variable easier to
  find in the debugger, since the name is more unique.

* Move the 'plugged' data member of 'bulk_checkin_state' into a separate
  static variable. Doing this avoids resetting the variable in
  finish_bulk_checkin when zeroing the 'bulk_checkin_state'. As-is, we
  seem to unintentionally disable the plugging functionality the first
  time a new packfile must be created due to packfile size limits. While
  disabling the plugging state only results in suboptimal behavior for
  the current code, it would be fatal for the bulk-fsync functionality
  later in this patch series.

Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
---
 bulk-checkin.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

Comments

Junio C Hamano March 30, 2022, 5:11 p.m. UTC | #1
"Neeraj Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Batched fsync will fit into bulk-checkin by taking advantage of the
> plug/unplug functionality to determine the appropriate time to fsync
> and make newly-added objects available in the primary object database.
>
> * Rename 'state' variable to 'bulk_checkin_state', since we will later
>   be adding 'bulk_fsync_objdir'.  This also makes the variable easier to
>   find in the debugger, since the name is more unique.
>
> * Move the 'plugged' data member of 'bulk_checkin_state' into a separate
>   static variable. Doing this avoids resetting the variable in
>   finish_bulk_checkin when zeroing the 'bulk_checkin_state'. As-is, we
>   seem to unintentionally disable the plugging functionality the first
>   time a new packfile must be created due to packfile size limits. While
>   disabling the plugging state only results in suboptimal behavior for
>   the current code, it would be fatal for the bulk-fsync functionality
>   later in this patch series.

Paraphrasing to make sure I understand your reasoning here...

In the "plug and then perform as many changes to the repository and
finally unplug" flow, before or after this series, the "perform"
step in the middle is unaware of which "bulk_checkin_state" instance
is being used to keep track of what is done to optimize by deferring
some operations until the "unplug" time.  So bulk_checkin_state is
not there to allow us to create multiple instances of it, pass them
around to different sequences of "plug, perform, unplug".  Each of
its members is inherently a singleton, so in the extreme, we could
turn these members into separate file-scope global variables if we
wanted to.  The "plugged" bit happens to be the only one getting
ejected by this patch, because it is inconvenient to "clear" other
members otherwise.

Is that what is going on?

If it is, I am mildly opposed to the flow of thought, from at least
two reasons.  It makes it hard for the next developer to decide if
the new members they are adding should be in or out of the struct.

More importantly, I think the call of finish_bulk_checkin() we make
in deflate_to_pack() you found (and there may possibly other places
that we do so; I didn't check) may not appear to be a bug in the
original context, but it already is a bug.  And when we change the
semantics of plug-unplug to be more "transaction-like", it becomes a
more serious bug, as you said.

There is NO reason to end the ongoing transaction there inside the
while() loop that tries to limit the size of the packfile being
used.  We may want to flush the "packfile part", which may have been
almost synonymous to the entirety of bulk_checkin_state, but as you
found out, the "plugged" bit is *outside* the "packfile part", and
that makes it a bug to call finish_bulk_checkin() from there.

We should add a new function, flush_bulk_checking_packfile(), to
flush only the packfile part of the bulk_checkin_state without
affecting other things---the "plugged" bit is the only one in the
current code before this series, but it does not have to stay to be
so.  When you start plugging the loose ref transactions, you may
find it handy (this is me handwaving) to have a list of refs that
you may have to do something at "unplug" time kept in the struct,
and you do not want deflate_to_pack() affecting the ongoing
"plugged" ref operations by calling finish_bulk_checkin() and
reinitializing that list, for example.

And then we should examine existing calls to finish_bulk_checkin()
and replace the ones that should not be finishing, i.e. the ones
that wanted "flush" but called "finish".
Neeraj Singh March 30, 2022, 6:34 p.m. UTC | #2
On Wed, Mar 30, 2022 at 10:11 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Neeraj Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > Batched fsync will fit into bulk-checkin by taking advantage of the
> > plug/unplug functionality to determine the appropriate time to fsync
> > and make newly-added objects available in the primary object database.
> >
> > * Rename 'state' variable to 'bulk_checkin_state', since we will later
> >   be adding 'bulk_fsync_objdir'.  This also makes the variable easier to
> >   find in the debugger, since the name is more unique.
> >
> > * Move the 'plugged' data member of 'bulk_checkin_state' into a separate
> >   static variable. Doing this avoids resetting the variable in
> >   finish_bulk_checkin when zeroing the 'bulk_checkin_state'. As-is, we
> >   seem to unintentionally disable the plugging functionality the first
> >   time a new packfile must be created due to packfile size limits. While
> >   disabling the plugging state only results in suboptimal behavior for
> >   the current code, it would be fatal for the bulk-fsync functionality
> >   later in this patch series.
>
> Paraphrasing to make sure I understand your reasoning here...
>
> In the "plug and then perform as many changes to the repository and
> finally unplug" flow, before or after this series, the "perform"
> step in the middle is unaware of which "bulk_checkin_state" instance
> is being used to keep track of what is done to optimize by deferring
> some operations until the "unplug" time.  So bulk_checkin_state is
> not there to allow us to create multiple instances of it, pass them
> around to different sequences of "plug, perform, unplug".  Each of
> its members is inherently a singleton, so in the extreme, we could
> turn these members into separate file-scope global variables if we
> wanted to.  The "plugged" bit happens to be the only one getting
> ejected by this patch, because it is inconvenient to "clear" other
> members otherwise.
>
> Is that what is going on?
>

More or less.  The current state is all about creating a single
packfile for multiple large objects.  That packfile is a singleton
today (we could have an alternate implementation where there's a
separate packfile per thread in the future, so it's not inherent to
the API).  We want to do this if the top-level caller is okay with the
state being invisible until the "finish" call, and that is conveyed by
the "plugged" flag.

> If it is, I am mildly opposed to the flow of thought, from at least
> two reasons.  It makes it hard for the next developer to decide if
> the new members they are adding should be in or out of the struct.
>
> More importantly, I think the call of finish_bulk_checkin() we make
> in deflate_to_pack() you found (and there may possibly other places
> that we do so; I didn't check) may not appear to be a bug in the
> original context, but it already is a bug.  And when we change the
> semantics of plug-unplug to be more "transaction-like", it becomes a
> more serious bug, as you said.
>
> There is NO reason to end the ongoing transaction there inside the
> while() loop that tries to limit the size of the packfile being
> used.  We may want to flush the "packfile part", which may have been
> almost synonymous to the entirety of bulk_checkin_state, but as you
> found out, the "plugged" bit is *outside* the "packfile part", and
> that makes it a bug to call finish_bulk_checkin() from there.
>
> We should add a new function, flush_bulk_checking_packfile(), to
> flush only the packfile part of the bulk_checkin_state without
> affecting other things---the "plugged" bit is the only one in the
> current code before this series, but it does not have to stay to be
> so

I'm happy to rename the packfile related stuff to end with _packfile
to make it clear that all of that state and functionality is related
to batching of packfile additions.
So from this patch: s/bulk_checkin_state/bulk_checkin_packfile and
s/finish_bulk_checkin/finish_bulk_checkin_packfile.

My new state will be bulk_fsync_* (as it is already).  Any future
ODB-related state can go here too (I'm imagining a future
log-structured 'new objects pack' that we can append to for adding
small objects, similar to the bulk_checkin_packfile but allowing
appends from multiple git invocations).

> When you start plugging the loose ref transactions, you may
> find it handy (this is me handwaving) to have a list of refs that
> you may have to do something at "unplug" time kept in the struct,
> and you do not want deflate_to_pack() affecting the ongoing
> "plugged" ref operations by calling finish_bulk_checkin() and
> reinitializing that list, for example.
>

I don't believe ref transactions will go through this part of the
infrastructure.  Refs already have a good transaction system (that
partly inspired this rebranding, after I saw how Peter implemented
batch ref fsync).  I expect this area will remain all about the ODB as
a subsystem that can enlist in a larger repo->transaction.  So a
top-level Git command might initiate a repo transaction, which would
internally initiate an ODB transaction, index transaction, and ref
transaction. The repo transaction would support flushing each of the
subtransactions with an optimal number of fsyncs.

> And then we should examine existing calls to finish_bulk_checkin()
> and replace the ones that should not be finishing, i.e. the ones
> that wanted "flush" but called "finish".

Sure.  I can fix this, which will only change this file.  The only
case of "finishing" would be in unplug_bulk_checkin /
end_odb_transaction.

Thanks,
Neeraj
Junio C Hamano March 30, 2022, 8:24 p.m. UTC | #3
Neeraj Singh <nksingh85@gmail.com> writes:

>> We should add a new function, flush_bulk_checking_packfile(), to
>> flush only the packfile part of the bulk_checkin_state without
>> affecting other things---the "plugged" bit is the only one in the
>> current code before this series, but it does not have to stay to be
>> so
>
> I'm happy to rename the packfile related stuff to end with _packfile
> to make it clear that all of that state and functionality is related
> to batching of packfile additions.

I do not care about names, though.  If you took that I hinted any
such change, sorry about that.  _state is fine as-is.

I do care about not ejecting plugged out of the structure and
instead keeping them together, with proper way to flush the part
that deflate_to_pack() wants to flush, instead of abusing the
"finish".

Thanks.
Neeraj Singh March 31, 2022, 4:17 a.m. UTC | #4
On Wed, Mar 30, 2022 at 1:24 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Neeraj Singh <nksingh85@gmail.com> writes:
>
> >> We should add a new function, flush_bulk_checking_packfile(), to
> >> flush only the packfile part of the bulk_checkin_state without
> >> affecting other things---the "plugged" bit is the only one in the
> >> current code before this series, but it does not have to stay to be
> >> so
> >
> > I'm happy to rename the packfile related stuff to end with _packfile
> > to make it clear that all of that state and functionality is related
> > to batching of packfile additions.
>
> I do not care about names, though.  If you took that I hinted any
> such change, sorry about that.  _state is fine as-is.
>
> I do care about not ejecting plugged out of the structure and
> instead keeping them together, with proper way to flush the part
> that deflate_to_pack() wants to flush, instead of abusing the
> "finish".
>
> Thanks.

Just to understand your feedback better, is it a problem to separate
the state of each separate "thing" under ODB transactions into
separate file-scope global(s)?  In this series I declared the fsync
state as completely separate from the packfile state.  That's why I
was thinking of it as more of a naming problem, since the remaining
state aside from the plugged boolean is entirely packfile related.

My argument in favor of having separate file-scoped variables for each
'pluggable thing' would be that future implementations can evolve
separately without authors first having to disentangle a single
struct.

Thanks,
Neeraj
Junio C Hamano March 31, 2022, 5:50 p.m. UTC | #5
Neeraj Singh <nksingh85@gmail.com> writes:

> Just to understand your feedback better, is it a problem to separate
> the state of each separate "thing" under ODB transactions into
> separate file-scope global(s)?  In this series I declared the fsync
> state as completely separate from the packfile state.  That's why I
> was thinking of it as more of a naming problem, since the remaining
> state aside from the plugged boolean is entirely packfile related.

Ahh, sorry, my mistake.

I somehow thought that you would be making the existing "struct
bulk_checkin_state" infrastructure to cover not just the object
store but much more, perhaps because I partly mistook the motivation
to rename the structure (thinking again about it, since "checkin" is
the act of adding new objects to the object database from outside
sources (either from the working tree using "git add" command, or
from other sources using unpack-objects), the original name was
already fine to signal that it was about the object database, and
the need to rename it sounded like we were going to do much more
than the object database behind my head).

> My argument in favor of having separate file-scoped variables for each
> 'pluggable thing' would be that future implementations can evolve
> separately without authors first having to disentangle a single
> struct.

That is fine.  Would the trigger to "plug" and "unplug" also be
independent?  As you said elsewhere, the part to harden refs can
piggyback on the existing ref-transaction infrastructure.  I do not
know offhand what things other than loose objects that want "plug"
and "unplug" semantics, but if there are, are we going to have type
specific begin- and end-transaction?

Thanks.
Neeraj Singh March 31, 2022, 7:08 p.m. UTC | #6
On Thu, Mar 31, 2022 at 10:50 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Neeraj Singh <nksingh85@gmail.com> writes:
>
> > Just to understand your feedback better, is it a problem to separate
> > the state of each separate "thing" under ODB transactions into
> > separate file-scope global(s)?  In this series I declared the fsync
> > state as completely separate from the packfile state.  That's why I
> > was thinking of it as more of a naming problem, since the remaining
> > state aside from the plugged boolean is entirely packfile related.
>
> Ahh, sorry, my mistake.
>
> I somehow thought that you would be making the existing "struct
> bulk_checkin_state" infrastructure to cover not just the object
> store but much more, perhaps because I partly mistook the motivation
> to rename the structure (thinking again about it, since "checkin" is
> the act of adding new objects to the object database from outside
> sources (either from the working tree using "git add" command, or
> from other sources using unpack-objects), the original name was
> already fine to signal that it was about the object database, and
> the need to rename it sounded like we were going to do much more
> than the object database behind my head).
>
> > My argument in favor of having separate file-scoped variables for each
> > 'pluggable thing' would be that future implementations can evolve
> > separately without authors first having to disentangle a single
> > struct.
>
> That is fine.  Would the trigger to "plug" and "unplug" also be
> independent?  As you said elsewhere, the part to harden refs can
> piggyback on the existing ref-transaction infrastructure.  I do not
> know offhand what things other than loose objects that want "plug"
> and "unplug" semantics, but if there are, are we going to have type
> specific begin- and end-transaction?
>

With regards to bulk-checkin.h, I believe for simplicity of interface
to the callers, there should be a single pair of APIs for plug or
unplug of the entire ODB regardless of what optimizations happen under
the covers.  For eventual repo-wide transactions, there should be a
single API to initiate a transaction and a single one to commit/abort
the transaction at the end.  We may still also want a flush API so
that we can make the repo state consistent prior to executing hooks or
doing something else where an outside process needs consistent repo
state.

Thanks,
Neeraj
diff mbox series

Patch

diff --git a/bulk-checkin.c b/bulk-checkin.c
index 6d6c37171c9..577b135e39c 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -10,9 +10,9 @@ 
 #include "packfile.h"
 #include "object-store.h"
 
-static struct bulk_checkin_state {
-	unsigned plugged:1;
+static int bulk_checkin_plugged;
 
+static struct bulk_checkin_state {
 	char *pack_tmp_name;
 	struct hashfile *f;
 	off_t offset;
@@ -21,7 +21,7 @@  static struct bulk_checkin_state {
 	struct pack_idx_entry **written;
 	uint32_t alloc_written;
 	uint32_t nr_written;
-} state;
+} bulk_checkin_state;
 
 static void finish_tmp_packfile(struct strbuf *basename,
 				const char *pack_tmp_name,
@@ -278,21 +278,23 @@  int index_bulk_checkin(struct object_id *oid,
 		       int fd, size_t size, enum object_type type,
 		       const char *path, unsigned flags)
 {
-	int status = deflate_to_pack(&state, oid, fd, size, type,
+	int status = deflate_to_pack(&bulk_checkin_state, oid, fd, size, type,
 				     path, flags);
-	if (!state.plugged)
-		finish_bulk_checkin(&state);
+	if (!bulk_checkin_plugged)
+		finish_bulk_checkin(&bulk_checkin_state);
 	return status;
 }
 
 void plug_bulk_checkin(void)
 {
-	state.plugged = 1;
+	assert(!bulk_checkin_plugged);
+	bulk_checkin_plugged = 1;
 }
 
 void unplug_bulk_checkin(void)
 {
-	state.plugged = 0;
-	if (state.f)
-		finish_bulk_checkin(&state);
+	assert(bulk_checkin_plugged);
+	bulk_checkin_plugged = 0;
+	if (bulk_checkin_state.f)
+		finish_bulk_checkin(&bulk_checkin_state);
 }