diff mbox series

[4/5] scalar: teach `diagnose` to gather loose objects information

Message ID 213f2c94b73f90fc758c2e3872804cf640cb2005.1643186507.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series scalar: implement the subcommand "diagnose" | expand

Commit Message

Matthew John Cheetham Jan. 26, 2022, 8:41 a.m. UTC
From: Matthew John Cheetham <mjcheetham@outlook.com>

When operating at the scale that Scalar wants to support, certain data
shapes are more likely to cause undesirable performance issues, such as
large numbers or large sizes of loose objects.

By including statistics about this, `scalar diagnose` now makes it
easier to identify such scenarios.

Signed-off-by: Matthew John Cheetham <mjcheetham@outlook.com>
---
 contrib/scalar/scalar.c          | 60 ++++++++++++++++++++++++++++++++
 contrib/scalar/t/t9099-scalar.sh |  2 ++
 2 files changed, 62 insertions(+)

Comments

Taylor Blau Jan. 26, 2022, 10:50 p.m. UTC | #1
On Wed, Jan 26, 2022 at 08:41:46AM +0000, Matthew John Cheetham via GitGitGadget wrote:
> +	while ((e = readdir(dir)) != NULL)
> +		if (!is_dot_or_dotdot(e->d_name) &&
> +		    e->d_type == DT_DIR && strlen(e->d_name) == 2 &&
> +		    !hex_to_bytes(&c, e->d_name, 1)) {

What is this call to hex_to_bytes() for? I assume it's checking to make
sure the directory we're looking at is one of the shards of loose
objects.

Similar to my suggestion on the previous patch, I think that we could
get rid of this function entirely and replace it with a call to
for_each_loose_file_in_objdir().

We'll pay a little bit of extra cost to parse out each loose object's
OID, but it should be negligible since we're not actually opening up
each object.

> diff --git a/contrib/scalar/t/t9099-scalar.sh b/contrib/scalar/t/t9099-scalar.sh
> index b1745851e31..f2ec156d819 100755
> --- a/contrib/scalar/t/t9099-scalar.sh
> +++ b/contrib/scalar/t/t9099-scalar.sh
> @@ -77,6 +77,8 @@ test_expect_success UNZIP 'scalar diagnose' '
>  	unzip -p "$zip_path" diagnostics.log >out &&
>  	test_file_not_empty out &&
>  	unzip -p "$zip_path" packs-local.txt >out &&
> +	test_file_not_empty out &&

A more comprehensive test (here, and in the earlier instances, too)
might be useful beyond just "does this file exist in the archive".

Constructing an example repository where the number of loose objects is
known ahead of time, and then finding that number in the output of
objects-local.txt might be worthwhile to give us some extra confidence
that this is working as intended.

> +	unzip -p "$zip_path" objects-local.txt >out &&
>  	test_file_not_empty out
>  '

Thanks,
Taylor
Derrick Stolee Jan. 27, 2022, 3:17 p.m. UTC | #2
On 1/26/2022 5:50 PM, Taylor Blau wrote:
> On Wed, Jan 26, 2022 at 08:41:46AM +0000, Matthew John Cheetham via GitGitGadget wrote:
>> +	while ((e = readdir(dir)) != NULL)
>> +		if (!is_dot_or_dotdot(e->d_name) &&
>> +		    e->d_type == DT_DIR && strlen(e->d_name) == 2 &&
>> +		    !hex_to_bytes(&c, e->d_name, 1)) {
> 
> What is this call to hex_to_bytes() for? I assume it's checking to make
> sure the directory we're looking at is one of the shards of loose
> objects.
> 
> Similar to my suggestion on the previous patch, I think that we could
> get rid of this function entirely and replace it with a call to
> for_each_loose_file_in_objdir().

There is a possibility that there are files other than loose objects
in these directories, so summarizing those counts might be helpful
information. For example: if somehow .git/objects/00/ was full of a
bunch of non-objects, it would still slow down Git commands that ask
for a short-sha starting with "00".

While this shouldn't be a normal case, the 'diagnose' command is
built to help us find these extremely odd scenarios because they
_have_ happened before (typically because of a VFS for Git bug
taught us how to look for these situations).

Thanks,
-Stolee
Elijah Newren Jan. 27, 2022, 6:59 p.m. UTC | #3
On Wed, Jan 26, 2022 at 3:37 PM Matthew John Cheetham via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Matthew John Cheetham <mjcheetham@outlook.com>
>
> When operating at the scale that Scalar wants to support, certain data
> shapes are more likely to cause undesirable performance issues, such as
> large numbers or large sizes of loose objects.

Makes sense.

> By including statistics about this, `scalar diagnose` now makes it
> easier to identify such scenarios.
>
> Signed-off-by: Matthew John Cheetham <mjcheetham@outlook.com>
> ---
>  contrib/scalar/scalar.c          | 60 ++++++++++++++++++++++++++++++++
>  contrib/scalar/t/t9099-scalar.sh |  2 ++
>  2 files changed, 62 insertions(+)
>
> diff --git a/contrib/scalar/scalar.c b/contrib/scalar/scalar.c
> index 690933ffdf3..c0ad4948215 100644
> --- a/contrib/scalar/scalar.c
> +++ b/contrib/scalar/scalar.c
> @@ -686,6 +686,60 @@ static void dir_file_stats(struct strbuf *buf, const char *path)
>         closedir(dir);
>  }
>
> +static int count_files(char *path)
> +{
> +       DIR *dir = opendir(path);
> +       struct dirent *e;
> +       int count = 0;
> +
> +       if (!dir)
> +               return 0;
> +
> +       while ((e = readdir(dir)) != NULL)
> +               if (!is_dot_or_dotdot(e->d_name) && e->d_type == DT_REG)
> +                       count++;
> +
> +       closedir(dir);
> +       return count;
> +}
> +
> +static void loose_objs_stats(struct strbuf *buf, const char *path)
> +{
> +       DIR *dir = opendir(path);
> +       struct dirent *e;
> +       int count;
> +       int total = 0;
> +       unsigned char c;
> +       struct strbuf count_path = STRBUF_INIT;
> +       size_t base_path_len;
> +
> +       if (!dir)
> +               return;
> +
> +       strbuf_addstr(buf, "Object directory stats for ");
> +       strbuf_add_absolute_path(buf, path);
> +       strbuf_addstr(buf, ":\n");
> +
> +       strbuf_add_absolute_path(&count_path, path);
> +       strbuf_addch(&count_path, '/');
> +       base_path_len = count_path.len;
> +
> +       while ((e = readdir(dir)) != NULL)
> +               if (!is_dot_or_dotdot(e->d_name) &&
> +                   e->d_type == DT_DIR && strlen(e->d_name) == 2 &&
> +                   !hex_to_bytes(&c, e->d_name, 1)) {

You only recurse into directories, ignoring individual files.

> +                       strbuf_setlen(&count_path, base_path_len);
> +                       strbuf_addstr(&count_path, e->d_name);
> +                       total += (count = count_files(count_path.buf));
> +                       strbuf_addf(buf, "%s : %7d files\n", e->d_name, count);

This shows the number of files within a directory.

> +               }
> +
> +       strbuf_addf(buf, "Total: %d loose objects", total);

and this shows the total number of files across all the directories.

But the commit message suggested you also wanted to check for large
sizes of loose objects.  Did that get ripped out at some point with
the commit message not being updated, or is it perhaps going to be
included later?

> +
> +       strbuf_release(&count_path);
> +       closedir(dir);
> +}
> +
>  static int cmd_diagnose(int argc, const char **argv)
>  {
>         struct option options[] = {
> @@ -734,6 +788,12 @@ static int cmd_diagnose(int argc, const char **argv)
>         if ((res = stage(tmp_dir.buf, &buf, "packs-local.txt")))
>                 goto diagnose_cleanup;
>
> +       strbuf_reset(&buf);
> +       loose_objs_stats(&buf, ".git/objects");
> +
> +       if ((res = stage(tmp_dir.buf, &buf, "objects-local.txt")))
> +               goto diagnose_cleanup;
> +
>         if ((res = stage_directory(tmp_dir.buf, ".git", 0)) ||
>             (res = stage_directory(tmp_dir.buf, ".git/hooks", 0)) ||
>             (res = stage_directory(tmp_dir.buf, ".git/info", 0)) ||
> diff --git a/contrib/scalar/t/t9099-scalar.sh b/contrib/scalar/t/t9099-scalar.sh
> index b1745851e31..f2ec156d819 100755
> --- a/contrib/scalar/t/t9099-scalar.sh
> +++ b/contrib/scalar/t/t9099-scalar.sh
> @@ -77,6 +77,8 @@ test_expect_success UNZIP 'scalar diagnose' '
>         unzip -p "$zip_path" diagnostics.log >out &&
>         test_file_not_empty out &&
>         unzip -p "$zip_path" packs-local.txt >out &&
> +       test_file_not_empty out &&
> +       unzip -p "$zip_path" objects-local.txt >out &&
>         test_file_not_empty out
>  '
>
> --
> gitgitgadget
Johannes Schindelin Feb. 6, 2022, 9:25 p.m. UTC | #4
Hi Elijah,

On Thu, 27 Jan 2022, Elijah Newren wrote:

> On Wed, Jan 26, 2022 at 3:37 PM Matthew John Cheetham via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
> >
> > From: Matthew John Cheetham <mjcheetham@outlook.com>
> >
> > When operating at the scale that Scalar wants to support, certain data
> > shapes are more likely to cause undesirable performance issues, such as
> > large numbers or large sizes of loose objects.
>
> Makes sense.
>
> > By including statistics about this, `scalar diagnose` now makes it
> > easier to identify such scenarios.
> >
> > Signed-off-by: Matthew John Cheetham <mjcheetham@outlook.com>
> > ---
> >  contrib/scalar/scalar.c          | 60 ++++++++++++++++++++++++++++++++
> >  contrib/scalar/t/t9099-scalar.sh |  2 ++
> >  2 files changed, 62 insertions(+)
> >
> > diff --git a/contrib/scalar/scalar.c b/contrib/scalar/scalar.c
> > index 690933ffdf3..c0ad4948215 100644
> > --- a/contrib/scalar/scalar.c
> > +++ b/contrib/scalar/scalar.c
> > @@ -686,6 +686,60 @@ static void dir_file_stats(struct strbuf *buf, const char *path)
> >         closedir(dir);
> >  }
> >
> > +static int count_files(char *path)
> > +{
> > +       DIR *dir = opendir(path);
> > +       struct dirent *e;
> > +       int count = 0;
> > +
> > +       if (!dir)
> > +               return 0;
> > +
> > +       while ((e = readdir(dir)) != NULL)
> > +               if (!is_dot_or_dotdot(e->d_name) && e->d_type == DT_REG)
> > +                       count++;
> > +
> > +       closedir(dir);
> > +       return count;
> > +}
> > +
> > +static void loose_objs_stats(struct strbuf *buf, const char *path)
> > +{
> > +       DIR *dir = opendir(path);
> > +       struct dirent *e;
> > +       int count;
> > +       int total = 0;
> > +       unsigned char c;
> > +       struct strbuf count_path = STRBUF_INIT;
> > +       size_t base_path_len;
> > +
> > +       if (!dir)
> > +               return;
> > +
> > +       strbuf_addstr(buf, "Object directory stats for ");
> > +       strbuf_add_absolute_path(buf, path);
> > +       strbuf_addstr(buf, ":\n");
> > +
> > +       strbuf_add_absolute_path(&count_path, path);
> > +       strbuf_addch(&count_path, '/');
> > +       base_path_len = count_path.len;
> > +
> > +       while ((e = readdir(dir)) != NULL)
> > +               if (!is_dot_or_dotdot(e->d_name) &&
> > +                   e->d_type == DT_DIR && strlen(e->d_name) == 2 &&
> > +                   !hex_to_bytes(&c, e->d_name, 1)) {
>
> You only recurse into directories, ignoring individual files.
>
> > +                       strbuf_setlen(&count_path, base_path_len);
> > +                       strbuf_addstr(&count_path, e->d_name);
> > +                       total += (count = count_files(count_path.buf));
> > +                       strbuf_addf(buf, "%s : %7d files\n", e->d_name, count);
>
> This shows the number of files within a directory.
>
> > +               }
> > +
> > +       strbuf_addf(buf, "Total: %d loose objects", total);
>
> and this shows the total number of files across all the directories.
>
> But the commit message suggested you also wanted to check for large
> sizes of loose objects.  Did that get ripped out at some point with
> the commit message not being updated, or is it perhaps going to be
> included later?

No, there was no plan to include this information later, as the original
.NET implementation of `scalar diagnose` did not provide that information,
either (which I take as a strong sign that we never needed this type of
information to help users, at least not up until this point).

Besides, it would be kind of a difficult thing to say conclusively what
makes a loose file "big". Is it the zlib-compressed size on disk? Or the
unpacked size? Should there be a configurable threshold to determine when
an object is big? Should `core.bigFileThreshold` be co-opted for this?

Together with the fact that there was no need for this information in
practice, it makes me doubt that we should add this type of information. I
actually suspect that _iff_ information of that type would be helpful, a
more complete tool like git-sizer (https://github.com/github/git-sizer/)
would be needed, and I do not really want to subsume git-sizer's
functionality in `scalar diagnose`.

I rephrased the commit message.

Ciao,
Dscho

>
> > +
> > +       strbuf_release(&count_path);
> > +       closedir(dir);
> > +}
> > +
> >  static int cmd_diagnose(int argc, const char **argv)
> >  {
> >         struct option options[] = {
> > @@ -734,6 +788,12 @@ static int cmd_diagnose(int argc, const char **argv)
> >         if ((res = stage(tmp_dir.buf, &buf, "packs-local.txt")))
> >                 goto diagnose_cleanup;
> >
> > +       strbuf_reset(&buf);
> > +       loose_objs_stats(&buf, ".git/objects");
> > +
> > +       if ((res = stage(tmp_dir.buf, &buf, "objects-local.txt")))
> > +               goto diagnose_cleanup;
> > +
> >         if ((res = stage_directory(tmp_dir.buf, ".git", 0)) ||
> >             (res = stage_directory(tmp_dir.buf, ".git/hooks", 0)) ||
> >             (res = stage_directory(tmp_dir.buf, ".git/info", 0)) ||
> > diff --git a/contrib/scalar/t/t9099-scalar.sh b/contrib/scalar/t/t9099-scalar.sh
> > index b1745851e31..f2ec156d819 100755
> > --- a/contrib/scalar/t/t9099-scalar.sh
> > +++ b/contrib/scalar/t/t9099-scalar.sh
> > @@ -77,6 +77,8 @@ test_expect_success UNZIP 'scalar diagnose' '
> >         unzip -p "$zip_path" diagnostics.log >out &&
> >         test_file_not_empty out &&
> >         unzip -p "$zip_path" packs-local.txt >out &&
> > +       test_file_not_empty out &&
> > +       unzip -p "$zip_path" objects-local.txt >out &&
> >         test_file_not_empty out
> >  '
> >
> > --
> > gitgitgadget
>
diff mbox series

Patch

diff --git a/contrib/scalar/scalar.c b/contrib/scalar/scalar.c
index 690933ffdf3..c0ad4948215 100644
--- a/contrib/scalar/scalar.c
+++ b/contrib/scalar/scalar.c
@@ -686,6 +686,60 @@  static void dir_file_stats(struct strbuf *buf, const char *path)
 	closedir(dir);
 }
 
+static int count_files(char *path)
+{
+	DIR *dir = opendir(path);
+	struct dirent *e;
+	int count = 0;
+
+	if (!dir)
+		return 0;
+
+	while ((e = readdir(dir)) != NULL)
+		if (!is_dot_or_dotdot(e->d_name) && e->d_type == DT_REG)
+			count++;
+
+	closedir(dir);
+	return count;
+}
+
+static void loose_objs_stats(struct strbuf *buf, const char *path)
+{
+	DIR *dir = opendir(path);
+	struct dirent *e;
+	int count;
+	int total = 0;
+	unsigned char c;
+	struct strbuf count_path = STRBUF_INIT;
+	size_t base_path_len;
+
+	if (!dir)
+		return;
+
+	strbuf_addstr(buf, "Object directory stats for ");
+	strbuf_add_absolute_path(buf, path);
+	strbuf_addstr(buf, ":\n");
+
+	strbuf_add_absolute_path(&count_path, path);
+	strbuf_addch(&count_path, '/');
+	base_path_len = count_path.len;
+
+	while ((e = readdir(dir)) != NULL)
+		if (!is_dot_or_dotdot(e->d_name) &&
+		    e->d_type == DT_DIR && strlen(e->d_name) == 2 &&
+		    !hex_to_bytes(&c, e->d_name, 1)) {
+			strbuf_setlen(&count_path, base_path_len);
+			strbuf_addstr(&count_path, e->d_name);
+			total += (count = count_files(count_path.buf));
+			strbuf_addf(buf, "%s : %7d files\n", e->d_name, count);
+		}
+
+	strbuf_addf(buf, "Total: %d loose objects", total);
+
+	strbuf_release(&count_path);
+	closedir(dir);
+}
+
 static int cmd_diagnose(int argc, const char **argv)
 {
 	struct option options[] = {
@@ -734,6 +788,12 @@  static int cmd_diagnose(int argc, const char **argv)
 	if ((res = stage(tmp_dir.buf, &buf, "packs-local.txt")))
 		goto diagnose_cleanup;
 
+	strbuf_reset(&buf);
+	loose_objs_stats(&buf, ".git/objects");
+
+	if ((res = stage(tmp_dir.buf, &buf, "objects-local.txt")))
+		goto diagnose_cleanup;
+
 	if ((res = stage_directory(tmp_dir.buf, ".git", 0)) ||
 	    (res = stage_directory(tmp_dir.buf, ".git/hooks", 0)) ||
 	    (res = stage_directory(tmp_dir.buf, ".git/info", 0)) ||
diff --git a/contrib/scalar/t/t9099-scalar.sh b/contrib/scalar/t/t9099-scalar.sh
index b1745851e31..f2ec156d819 100755
--- a/contrib/scalar/t/t9099-scalar.sh
+++ b/contrib/scalar/t/t9099-scalar.sh
@@ -77,6 +77,8 @@  test_expect_success UNZIP 'scalar diagnose' '
 	unzip -p "$zip_path" diagnostics.log >out &&
 	test_file_not_empty out &&
 	unzip -p "$zip_path" packs-local.txt >out &&
+	test_file_not_empty out &&
+	unzip -p "$zip_path" objects-local.txt >out &&
 	test_file_not_empty out
 '