diff mbox series

[RFC,2/3] tmp-objdir: introduce `tmp_objdir_repack()`

Message ID 0f19c139ba9bb5105747f545038825d0c89f2e42.1699381371.git.me@ttaylorr.com (mailing list archive)
State New, archived
Headers show
Series replay: implement support for writing new objects to a pack | expand

Commit Message

Taylor Blau Nov. 7, 2023, 6:22 p.m. UTC
In the following commit, we will teach `git replay` how to write a pack
containing the set of new objects created as a result of the `replay`
operation.

Since `replay` needs to be able to see the object(s) written
from previous steps in order to replay each commit, the ODB transaction
may have multiple pending packs. Migrating multiple packs back into the
main object store has a couple of downsides:

  - It is error-prone to do so: each pack must be migrated in the
    correct order (with the ".idx" file staged last), and the set of
    packs themselves must be moved over in the correct order to avoid
    racy behavior.

  - It is a (potentially significant) performance degradation to migrate
    a large number of packs back into the main object store.

Introduce a new function that combines the set of all packs in the
temporary object store to produce a single pack which is the logical
concatenation of all packs created during that level of the ODB
transaction.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 tmp-objdir.c | 13 +++++++++++++
 tmp-objdir.h |  6 ++++++
 2 files changed, 19 insertions(+)

Comments

Patrick Steinhardt Nov. 8, 2023, 7:05 a.m. UTC | #1
On Tue, Nov 07, 2023 at 01:22:58PM -0500, Taylor Blau wrote:
> In the following commit, we will teach `git replay` how to write a pack
> containing the set of new objects created as a result of the `replay`
> operation.
> 
> Since `replay` needs to be able to see the object(s) written
> from previous steps in order to replay each commit, the ODB transaction
> may have multiple pending packs. Migrating multiple packs back into the
> main object store has a couple of downsides:
> 
>   - It is error-prone to do so: each pack must be migrated in the
>     correct order (with the ".idx" file staged last), and the set of
>     packs themselves must be moved over in the correct order to avoid
>     racy behavior.
> 
>   - It is a (potentially significant) performance degradation to migrate
>     a large number of packs back into the main object store.
> 
> Introduce a new function that combines the set of all packs in the
> temporary object store to produce a single pack which is the logical
> concatenation of all packs created during that level of the ODB
> transaction.
> 
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  tmp-objdir.c | 13 +++++++++++++
>  tmp-objdir.h |  6 ++++++
>  2 files changed, 19 insertions(+)
> 
> diff --git a/tmp-objdir.c b/tmp-objdir.c
> index 5f9074ad1c..ef53180b47 100644
> --- a/tmp-objdir.c
> +++ b/tmp-objdir.c
> @@ -12,6 +12,7 @@
>  #include "strvec.h"
>  #include "quote.h"
>  #include "object-store-ll.h"
> +#include "run-command.h"
>  
>  struct tmp_objdir {
>  	struct strbuf path;
> @@ -277,6 +278,18 @@ int tmp_objdir_migrate(struct tmp_objdir *t)
>  	return ret;
>  }
>  
> +int tmp_objdir_repack(struct tmp_objdir *t)
> +{
> +	struct child_process cmd = CHILD_PROCESS_INIT;
> +
> +	cmd.git_cmd = 1;
> +
> +	strvec_pushl(&cmd.args, "repack", "-a", "-d", "-k", "-l", NULL);
> +	strvec_pushv(&cmd.env, tmp_objdir_env(t));

I wonder what performance of this repack would be like in a large
repository with many refs. Ideally, I would expect that the repacking
performance should scale with the number of objects we have written into
the temporary object directory. But in practice, the repack will need to
compute reachability and thus also scales with the size of the repo
itself, doesn't it?

Patrick

> +	return run_command(&cmd);
> +}
> +
>  const char **tmp_objdir_env(const struct tmp_objdir *t)
>  {
>  	if (!t)
> diff --git a/tmp-objdir.h b/tmp-objdir.h
> index 237d96b660..d00e3b3e27 100644
> --- a/tmp-objdir.h
> +++ b/tmp-objdir.h
> @@ -36,6 +36,12 @@ struct tmp_objdir *tmp_objdir_create(const char *prefix);
>   */
>  const char **tmp_objdir_env(const struct tmp_objdir *);
>  
> +/*
> + * Combines all packs in the tmp_objdir into a single pack before migrating.
> + * Removes original pack(s) after installing the combined pack into place.
> + */
> +int tmp_objdir_repack(struct tmp_objdir *);
> +
>  /*
>   * Finalize a temporary object directory by migrating its objects into the main
>   * object database, removing the temporary directory, and freeing any
> -- 
> 2.42.0.446.g0b9ef90488
>
Taylor Blau Nov. 9, 2023, 7:26 p.m. UTC | #2
On Wed, Nov 08, 2023 at 08:05:46AM +0100, Patrick Steinhardt wrote:
> > @@ -277,6 +278,18 @@ int tmp_objdir_migrate(struct tmp_objdir *t)
> >  	return ret;
> >  }
> >
> > +int tmp_objdir_repack(struct tmp_objdir *t)
> > +{
> > +	struct child_process cmd = CHILD_PROCESS_INIT;
> > +
> > +	cmd.git_cmd = 1;
> > +
> > +	strvec_pushl(&cmd.args, "repack", "-a", "-d", "-k", "-l", NULL);
> > +	strvec_pushv(&cmd.env, tmp_objdir_env(t));
>
> I wonder what performance of this repack would be like in a large
> repository with many refs. Ideally, I would expect that the repacking
> performance should scale with the number of objects we have written into
> the temporary object directory. But in practice, the repack will need to
> compute reachability and thus also scales with the size of the repo
> itself, doesn't it?

Good question. We definitely do not want to be doing an all-into-one
repack as a consequence of running 'git replay' in a large repository
with lots of refs, objects, or both.

But since we push the result of calling `tmp_objdir_env(t)` into the
environment of the child process, we are only repacking the objects in
the temporary directory, not the main object store.

I have a test that verifies this is the case by making sure that in a
repository with some arbitrary set of pre-existing packs, that only one
pack is added to that set after running 'replay', and that the
pre-existing packs remain in place.

Thanks,
Taylor
diff mbox series

Patch

diff --git a/tmp-objdir.c b/tmp-objdir.c
index 5f9074ad1c..ef53180b47 100644
--- a/tmp-objdir.c
+++ b/tmp-objdir.c
@@ -12,6 +12,7 @@ 
 #include "strvec.h"
 #include "quote.h"
 #include "object-store-ll.h"
+#include "run-command.h"
 
 struct tmp_objdir {
 	struct strbuf path;
@@ -277,6 +278,18 @@  int tmp_objdir_migrate(struct tmp_objdir *t)
 	return ret;
 }
 
+int tmp_objdir_repack(struct tmp_objdir *t)
+{
+	struct child_process cmd = CHILD_PROCESS_INIT;
+
+	cmd.git_cmd = 1;
+
+	strvec_pushl(&cmd.args, "repack", "-a", "-d", "-k", "-l", NULL);
+	strvec_pushv(&cmd.env, tmp_objdir_env(t));
+
+	return run_command(&cmd);
+}
+
 const char **tmp_objdir_env(const struct tmp_objdir *t)
 {
 	if (!t)
diff --git a/tmp-objdir.h b/tmp-objdir.h
index 237d96b660..d00e3b3e27 100644
--- a/tmp-objdir.h
+++ b/tmp-objdir.h
@@ -36,6 +36,12 @@  struct tmp_objdir *tmp_objdir_create(const char *prefix);
  */
 const char **tmp_objdir_env(const struct tmp_objdir *);
 
+/*
+ * Combines all packs in the tmp_objdir into a single pack before migrating.
+ * Removes original pack(s) after installing the combined pack into place.
+ */
+int tmp_objdir_repack(struct tmp_objdir *);
+
 /*
  * Finalize a temporary object directory by migrating its objects into the main
  * object database, removing the temporary directory, and freeing any