diff mbox series

[06/10] sparse-checkout: use oidset to prevent repeat blobs

Message ID fcf948bda7aebcc5f88c17f5b308b2ce0cc285f5.1588857462.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series In-tree sparse-checkout definitions | expand

Commit Message

Linus Arver via GitGitGadget May 7, 2020, 1:17 p.m. UTC
From: Derrick Stolee <dstolee@microsoft.com>

As we parse the in-tree config files that store the sparse.dir values
used to create an in-tree sparse-checkout definition, we can easily
avoid parsing the same file multiple times by using an oidset on those
blobs. We only parse if the oid is new to the oidset.

This is unlikely to have a major performance benefit right now, but will
be extremely important when we introduce the sparse.inherit options to
link multiple files in a directed graph. This oidset will prevent
infinite loops when cycles exist in that digraph, or exponential blowups
even in the case of a directed acyclic graph.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 sparse-checkout.c | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

Comments

Elijah Newren May 20, 2020, 4:40 p.m. UTC | #1
On Thu, May 7, 2020 at 6:21 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> As we parse the in-tree config files that store the sparse.dir values
> used to create an in-tree sparse-checkout definition, we can easily
> avoid parsing the same file multiple times by using an oidset on those
> blobs. We only parse if the oid is new to the oidset.
>
> This is unlikely to have a major performance benefit right now, but will
> be extremely important when we introduce the sparse.inherit options to
> link multiple files in a directed graph. This oidset will prevent
> infinite loops when cycles exist in that digraph, or exponential blowups
> even in the case of a directed acyclic graph.

I'm still not sure if I like the idea of having a mirror dependency
structure separate from (and duplicative of) the build code; I'm still
mulling that over.

It's good that you've protected against infinite loops.

Is there any reason to prefer swallowing infinite loops rather than
warning or flagging as an error?  (I'm not sure, just thinking my
questions out loud.)

> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  sparse-checkout.c | 24 +++++++++++++++++++-----
>  1 file changed, 19 insertions(+), 5 deletions(-)
>
> diff --git a/sparse-checkout.c b/sparse-checkout.c
> index 6c58fda9722..d01f4d7b525 100644
> --- a/sparse-checkout.c
> +++ b/sparse-checkout.c
> @@ -9,6 +9,7 @@
>  #include "string-list.h"
>  #include "unpack-trees.h"
>  #include "object-store.h"
> +#include "oidset.h"
>
>  char *get_sparse_checkout_filename(void)
>  {
> @@ -77,9 +78,12 @@ int load_in_tree_pattern_list(struct repository *r,
>                               struct string_list *sl,
>                               struct pattern_list *pl)
>  {
> +       int result = 0;
>         struct index_state *istate = r->index;
>         struct string_list_item *item;
>         struct strbuf path = STRBUF_INIT;
> +       struct oidset set;
> +       oidset_init(&set, 16);
>
>         pl->use_cone_patterns = 1;
>
> @@ -96,24 +100,34 @@ int load_in_tree_pattern_list(struct repository *r,
>                  * Use -1 return to ensure populate_from_existing_patterns()
>                  * skips the sparse-checkout updates.
>                  */
> -               if (pos < 0)
> -                       return -1;
> +               if (pos < 0) {
> +                       result = -1;
> +                       goto cleanup;
> +               }
>
>                 oid = &istate->cache[pos]->oid;
> +
> +               if (oidset_contains(&set, oid))
> +                       continue;
> +
> +               oidset_insert(&set, oid);
> +
>                 type = oid_object_info(r, oid, NULL);
>
>                 if (type != OBJ_BLOB) {
>                         warning(_("expected a file at '%s'; not updating sparse-checkout"),
>                                 oid_to_hex(oid));
> -                       return 1;
> +                       result = 1;
> +                       goto cleanup;
>                 }
>
>                 load_in_tree_from_blob(pl, oid);
>         }
>
> +cleanup:
>         strbuf_release(&path);
> -
> -       return 0;
> +       oidset_clear(&set);
> +       return result;
>  }
>
>  int populate_sparse_checkout_patterns(struct pattern_list *pl)
> --
> gitgitgadget
>
Elijah Newren May 21, 2020, 3:49 a.m. UTC | #2
Replying to my own questions...

On Wed, May 20, 2020 at 9:40 AM Elijah Newren <newren@gmail.com> wrote:
>
> On Thu, May 7, 2020 at 6:21 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
> >
> > From: Derrick Stolee <dstolee@microsoft.com>
> >
> > As we parse the in-tree config files that store the sparse.dir values
> > used to create an in-tree sparse-checkout definition, we can easily
> > avoid parsing the same file multiple times by using an oidset on those
> > blobs. We only parse if the oid is new to the oidset.
> >
> > This is unlikely to have a major performance benefit right now, but will
> > be extremely important when we introduce the sparse.inherit options to
> > link multiple files in a directed graph. This oidset will prevent
> > infinite loops when cycles exist in that digraph, or exponential blowups
> > even in the case of a directed acyclic graph.
>
> I'm still not sure if I like the idea of having a mirror dependency
> structure separate from (and duplicative of) the build code; I'm still
> mulling that over.

I mentioned this to a few other buildsystem folks at $DAYJOB.  They
were strongly opposed to having more than one source of truth, but
generating the git in-tree sparse values from the official build
system files, with commit hooks and build system checks to make sure
they get updated seemed like it'd be fine or not concern them much.

> It's good that you've protected against infinite loops.
>
> Is there any reason to prefer swallowing infinite loops rather than
> warning or flagging as an error?  (I'm not sure, just thinking my
> questions out loud.)

The buildsystem folks also reminded me that we have cylic dependencies
already, and although it's absolutely ugly, it is somewhat forced on
us by a combination of different external tools that we can't change.
As such, warnings or errors would be really annoying and we'd be one
of the ones to want to turn them off.  So drop that idea from me.  :-)
Derrick Stolee May 21, 2020, 5:54 p.m. UTC | #3
On 5/20/2020 11:49 PM, Elijah Newren wrote:
> Replying to my own questions...
> 
> On Wed, May 20, 2020 at 9:40 AM Elijah Newren <newren@gmail.com> wrote:
>>
>> On Thu, May 7, 2020 at 6:21 AM Derrick Stolee via GitGitGadget
>> <gitgitgadget@gmail.com> wrote:
>>>
>>> From: Derrick Stolee <dstolee@microsoft.com>
>>>
>>> As we parse the in-tree config files that store the sparse.dir values
>>> used to create an in-tree sparse-checkout definition, we can easily
>>> avoid parsing the same file multiple times by using an oidset on those
>>> blobs. We only parse if the oid is new to the oidset.
>>>
>>> This is unlikely to have a major performance benefit right now, but will
>>> be extremely important when we introduce the sparse.inherit options to
>>> link multiple files in a directed graph. This oidset will prevent
>>> infinite loops when cycles exist in that digraph, or exponential blowups
>>> even in the case of a directed acyclic graph.
>>
>> I'm still not sure if I like the idea of having a mirror dependency
>> structure separate from (and duplicative of) the build code; I'm still
>> mulling that over.
> 
> I mentioned this to a few other buildsystem folks at $DAYJOB.  They
> were strongly opposed to having more than one source of truth, but
> generating the git in-tree sparse values from the official build
> system files, with commit hooks and build system checks to make sure
> they get updated seemed like it'd be fine or not concern them much.

The intention is that these files are "downstream" from the build
system. As the build changes, it would adjust the config files to
tell Git what to do.

>> It's good that you've protected against infinite loops.
>>
>> Is there any reason to prefer swallowing infinite loops rather than
>> warning or flagging as an error?  (I'm not sure, just thinking my
>> questions out loud.)
> 
> The buildsystem folks also reminded me that we have cylic dependencies
> already, and although it's absolutely ugly, it is somewhat forced on
> us by a combination of different external tools that we can't change.
> As such, warnings or errors would be really annoying and we'd be one
> of the ones to want to turn them off.  So drop that idea from me.  :-)

This is more common than one might think!

Thanks for taking a look,
-Stolee
diff mbox series

Patch

diff --git a/sparse-checkout.c b/sparse-checkout.c
index 6c58fda9722..d01f4d7b525 100644
--- a/sparse-checkout.c
+++ b/sparse-checkout.c
@@ -9,6 +9,7 @@ 
 #include "string-list.h"
 #include "unpack-trees.h"
 #include "object-store.h"
+#include "oidset.h"
 
 char *get_sparse_checkout_filename(void)
 {
@@ -77,9 +78,12 @@  int load_in_tree_pattern_list(struct repository *r,
 			      struct string_list *sl,
 			      struct pattern_list *pl)
 {
+	int result = 0;
 	struct index_state *istate = r->index;
 	struct string_list_item *item;
 	struct strbuf path = STRBUF_INIT;
+	struct oidset set;
+	oidset_init(&set, 16);
 
 	pl->use_cone_patterns = 1;
 
@@ -96,24 +100,34 @@  int load_in_tree_pattern_list(struct repository *r,
 		 * Use -1 return to ensure populate_from_existing_patterns()
 		 * skips the sparse-checkout updates.
 		 */
-		if (pos < 0)
-			return -1;
+		if (pos < 0) {
+			result = -1;
+			goto cleanup;
+		}
 
 		oid = &istate->cache[pos]->oid;
+
+		if (oidset_contains(&set, oid))
+			continue;
+
+		oidset_insert(&set, oid);
+
 		type = oid_object_info(r, oid, NULL);
 
 		if (type != OBJ_BLOB) {
 			warning(_("expected a file at '%s'; not updating sparse-checkout"),
 				oid_to_hex(oid));
-			return 1;
+			result = 1;
+			goto cleanup;
 		}
 
 		load_in_tree_from_blob(pl, oid);
 	}
 
+cleanup:
 	strbuf_release(&path);
-
-	return 0;
+	oidset_clear(&set);
+	return result;
 }
 
 int populate_sparse_checkout_patterns(struct pattern_list *pl)