diff mbox series

[v2,3/7] update-index: use the bulk-checkin infrastructure

Message ID 54797dbc52060b7fa913642cd5266f7e159a5bc9.1647760561.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series core.fsyncmethod: add 'batch' mode for faster fsyncing of multiple objects | expand

Commit Message

Neeraj Singh (WINDOWS-SFS) March 20, 2022, 7:15 a.m. UTC
From: Neeraj Singh <neerajsi@microsoft.com>

The update-index functionality is used internally by 'git stash push' to
setup the internal stashed commit.

This change enables bulk-checkin for update-index infrastructure to
speed up adding new objects to the object database by leveraging the
batch fsync functionality.

There is some risk with this change, since under batch fsync, the object
files will be in a tmp-objdir until update-index is complete.  This
usage is unlikely, since any tool invoking update-index and expecting to
see objects would have to synchronize with the update-index process
after passing it a file path.

Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
---
 builtin/update-index.c | 6 ++++++
 1 file changed, 6 insertions(+)

Comments

Ævar Arnfjörð Bjarmason March 21, 2022, 3:01 p.m. UTC | #1
On Sun, Mar 20 2022, Neeraj Singh via GitGitGadget wrote:

> From: Neeraj Singh <neerajsi@microsoft.com>
>
> The update-index functionality is used internally by 'git stash push' to
> setup the internal stashed commit.
>
> This change enables bulk-checkin for update-index infrastructure to
> speed up adding new objects to the object database by leveraging the
> batch fsync functionality.
>
> There is some risk with this change, since under batch fsync, the object
> files will be in a tmp-objdir until update-index is complete.  This
> usage is unlikely, since any tool invoking update-index and expecting to
> see objects would have to synchronize with the update-index process
> after passing it a file path.
>
> Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
> ---
>  builtin/update-index.c | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/builtin/update-index.c b/builtin/update-index.c
> index 75d646377cc..38e9d7e88cb 100644
> --- a/builtin/update-index.c
> +++ b/builtin/update-index.c
> @@ -5,6 +5,7 @@
>   */
>  #define USE_THE_INDEX_COMPATIBILITY_MACROS
>  #include "cache.h"
> +#include "bulk-checkin.h"
>  #include "config.h"
>  #include "lockfile.h"
>  #include "quote.h"
> @@ -1110,6 +1111,9 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
>  
>  	the_index.updated_skipworktree = 1;
>  
> +	/* we might be adding many objects to the object database */
> +	plug_bulk_checkin();
> +

Shouldn't this be after parse_options_start()?
Junio C Hamano March 21, 2022, 5:50 p.m. UTC | #2
"Neeraj Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> diff --git a/builtin/update-index.c b/builtin/update-index.c
> index 75d646377cc..38e9d7e88cb 100644
> --- a/builtin/update-index.c
> +++ b/builtin/update-index.c
> @@ -5,6 +5,7 @@
>   */
>  #define USE_THE_INDEX_COMPATIBILITY_MACROS
>  #include "cache.h"
> +#include "bulk-checkin.h"
>  #include "config.h"
>  #include "lockfile.h"
>  #include "quote.h"
> @@ -1110,6 +1111,9 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
>  
>  	the_index.updated_skipworktree = 1;
>  
> +	/* we might be adding many objects to the object database */
> +	plug_bulk_checkin();
> +
>  	/*
>  	 * Custom copy of parse_options() because we want to handle
>  	 * filename arguments as they come.
> @@ -1190,6 +1194,8 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
>  		strbuf_release(&buf);
>  	}
>  
> +	/* by now we must have added all of the new objects */
> +	unplug_bulk_checkin();

I understand read-from-stdin code path would be worth plugging, but
the list of paths on the command line?  How many of them would one
fit?

Of course, the feeder may be expecting for the objects to appear in
the object store as it feeds the paths and will be utterly broken by
this change, as you mentioned in the proposed log message.  The
existing plug/unplug will change the behaviour by making the objects
sent to the packfile available only after getting unplugged.  This
series makes it even worse by making loose objects also unavailable
until unplug is called.

So, it probably is safer and more sensible approach to introduce a
new command line option to allow the bulk checkin, and those who do
not care about the intermediate state to opt into the new feature.
Neeraj Singh March 21, 2022, 10:09 p.m. UTC | #3
On Mon, Mar 21, 2022 at 8:04 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Sun, Mar 20 2022, Neeraj Singh via GitGitGadget wrote:
>
> > From: Neeraj Singh <neerajsi@microsoft.com>
> >
> > The update-index functionality is used internally by 'git stash push' to
> > setup the internal stashed commit.
> >
> > This change enables bulk-checkin for update-index infrastructure to
> > speed up adding new objects to the object database by leveraging the
> > batch fsync functionality.
> >
> > There is some risk with this change, since under batch fsync, the object
> > files will be in a tmp-objdir until update-index is complete.  This
> > usage is unlikely, since any tool invoking update-index and expecting to
> > see objects would have to synchronize with the update-index process
> > after passing it a file path.
> >
> > Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
> > ---
> >  builtin/update-index.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> >
> > diff --git a/builtin/update-index.c b/builtin/update-index.c
> > index 75d646377cc..38e9d7e88cb 100644
> > --- a/builtin/update-index.c
> > +++ b/builtin/update-index.c
> > @@ -5,6 +5,7 @@
> >   */
> >  #define USE_THE_INDEX_COMPATIBILITY_MACROS
> >  #include "cache.h"
> > +#include "bulk-checkin.h"
> >  #include "config.h"
> >  #include "lockfile.h"
> >  #include "quote.h"
> > @@ -1110,6 +1111,9 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
> >
> >       the_index.updated_skipworktree = 1;
> >
> > +     /* we might be adding many objects to the object database */
> > +     plug_bulk_checkin();
> > +
>
> Shouldn't this be after parse_options_start()?

Does it make a difference?  Especially if we do the object dir creation lazily?
Neeraj Singh March 21, 2022, 10:18 p.m. UTC | #4
On Mon, Mar 21, 2022 at 10:50 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Neeraj Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > diff --git a/builtin/update-index.c b/builtin/update-index.c
> > index 75d646377cc..38e9d7e88cb 100644
> > --- a/builtin/update-index.c
> > +++ b/builtin/update-index.c
> > @@ -5,6 +5,7 @@
> >   */
> >  #define USE_THE_INDEX_COMPATIBILITY_MACROS
> >  #include "cache.h"
> > +#include "bulk-checkin.h"
> >  #include "config.h"
> >  #include "lockfile.h"
> >  #include "quote.h"
> > @@ -1110,6 +1111,9 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
> >
> >       the_index.updated_skipworktree = 1;
> >
> > +     /* we might be adding many objects to the object database */
> > +     plug_bulk_checkin();
> > +
> >       /*
> >        * Custom copy of parse_options() because we want to handle
> >        * filename arguments as they come.
> > @@ -1190,6 +1194,8 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
> >               strbuf_release(&buf);
> >       }
> >
> > +     /* by now we must have added all of the new objects */
> > +     unplug_bulk_checkin();
>
> I understand read-from-stdin code path would be worth plugging, but
> the list of paths on the command line?  How many of them would one
> fit?
>

do_reupdate could touch all the files in the index.  Also one can pass a
directory, and re-add all files under the directory.

> Of course, the feeder may be expecting for the objects to appear in
> the object store as it feeds the paths and will be utterly broken by
> this change, as you mentioned in the proposed log message.  The
> existing plug/unplug will change the behaviour by making the objects
> sent to the packfile available only after getting unplugged.  This
> series makes it even worse by making loose objects also unavailable
> until unplug is called.
>
> So, it probably is safer and more sensible approach to introduce a
> new command line option to allow the bulk checkin, and those who do
> not care about the intermediate state to opt into the new feature.
>

I don't believe this usage is likely today. How would the feeder know when
it can expect to find an object in the object directory after passing something
on stdin?  When fed via stdin, git-update-index will asynchronously add that
object to the object database, leaving no indication to the feeder of when it
actually happens, aside from it happening before the git-update-index process
terminates.  I used to have a comment here about the feeder being able to
parse the --verbose output to get feedback from git-update-index, which
would be quite tricky. I thought it was unnecessarily detailed.

Thanks,
Neeraj
Ævar Arnfjörð Bjarmason March 21, 2022, 11:16 p.m. UTC | #5
On Mon, Mar 21 2022, Neeraj Singh wrote:

> On Mon, Mar 21, 2022 at 8:04 AM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>
>>
>> On Sun, Mar 20 2022, Neeraj Singh via GitGitGadget wrote:
>>
>> > From: Neeraj Singh <neerajsi@microsoft.com>
>> >
>> > The update-index functionality is used internally by 'git stash push' to
>> > setup the internal stashed commit.
>> >
>> > This change enables bulk-checkin for update-index infrastructure to
>> > speed up adding new objects to the object database by leveraging the
>> > batch fsync functionality.
>> >
>> > There is some risk with this change, since under batch fsync, the object
>> > files will be in a tmp-objdir until update-index is complete.  This
>> > usage is unlikely, since any tool invoking update-index and expecting to
>> > see objects would have to synchronize with the update-index process
>> > after passing it a file path.
>> >
>> > Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
>> > ---
>> >  builtin/update-index.c | 6 ++++++
>> >  1 file changed, 6 insertions(+)
>> >
>> > diff --git a/builtin/update-index.c b/builtin/update-index.c
>> > index 75d646377cc..38e9d7e88cb 100644
>> > --- a/builtin/update-index.c
>> > +++ b/builtin/update-index.c
>> > @@ -5,6 +5,7 @@
>> >   */
>> >  #define USE_THE_INDEX_COMPATIBILITY_MACROS
>> >  #include "cache.h"
>> > +#include "bulk-checkin.h"
>> >  #include "config.h"
>> >  #include "lockfile.h"
>> >  #include "quote.h"
>> > @@ -1110,6 +1111,9 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
>> >
>> >       the_index.updated_skipworktree = 1;
>> >
>> > +     /* we might be adding many objects to the object database */
>> > +     plug_bulk_checkin();
>> > +
>>
>> Shouldn't this be after parse_options_start()?
>
> Does it make a difference?  Especially if we do the object dir creation lazily?

I think it won't matter for the machine, but it helps with readability
to keep code like this as close to where it's used as possible.

Close enough and we'd also spot the other bug I mentioned here,
i.e. that we're setting this up where we're not writing objects at all
:)
diff mbox series

Patch

diff --git a/builtin/update-index.c b/builtin/update-index.c
index 75d646377cc..38e9d7e88cb 100644
--- a/builtin/update-index.c
+++ b/builtin/update-index.c
@@ -5,6 +5,7 @@ 
  */
 #define USE_THE_INDEX_COMPATIBILITY_MACROS
 #include "cache.h"
+#include "bulk-checkin.h"
 #include "config.h"
 #include "lockfile.h"
 #include "quote.h"
@@ -1110,6 +1111,9 @@  int cmd_update_index(int argc, const char **argv, const char *prefix)
 
 	the_index.updated_skipworktree = 1;
 
+	/* we might be adding many objects to the object database */
+	plug_bulk_checkin();
+
 	/*
 	 * Custom copy of parse_options() because we want to handle
 	 * filename arguments as they come.
@@ -1190,6 +1194,8 @@  int cmd_update_index(int argc, const char **argv, const char *prefix)
 		strbuf_release(&buf);
 	}
 
+	/* by now we must have added all of the new objects */
+	unplug_bulk_checkin();
 	if (split_index > 0) {
 		if (git_config_get_split_index() == 0)
 			warning(_("core.splitIndex is set to false; "