diff mbox series

[RFC] index-pack: improve performance on NFS

Message ID ED25E182-C296-4D08-8170-340567D8964A@amazon.com (mailing list archive)
State New, archived
Headers show
Series [RFC] index-pack: improve performance on NFS | expand

Commit Message

Geert Jansen Oct. 25, 2018, 6:38 p.m. UTC
The index-pack command determines if a sha1 collision test is needed by
checking the existence of a loose object with the given hash. In my tests, I
can improve performance of “git clone” on Amazon EFS by 8x when used with a
non-default mount option (lookupcache=pos) that's required for a Gitlab HA
setup. My assumption is that this check is unnecessary when cloning into a new
repository because the repository will be empty.

By default, the Linux NFS client will cache directory entries as well as the
non-existence of directory entries. The latter means that when client c1 does
stat() on a file that does not exist, the non-existence will be cached and any
subsequent stat() operation on the file will return -ENOENT until the cache
expires or is invalidated, even if the file was created on client c2 in the
mean time. This leads to errors in a Gitlab HA setup when it distributes jobs
over multiple worker nodes assuming each worker node has the same view of the
shared file system.

The recommended workaround by Gitlab is to use the “lookupcache=pos” NFS mount
option which disables the negative lookup cache. This option has a high
performance impact. Cloning the gitlab-ce repository
(https://gitlab.com/gitlab-org/gitlib-ce.git) into an NFS mounted directory
gives the following results:

  lookupcache=all (default): 624 seconds
  lookupcache=pos: 4957 seconds

The reason for the poor performance is that index-pack will issue a stat()
call for every object in the repo when checking if a collision test is needed.
These stat() calls result in the following NFS operations:

  LOOKUP dirfh=".git/objects", name="01" -> NFS4ERR_ENOENT

With lookupcache=all, the non-existence of the .git/objects/XX directories is
cached, so that there will be at most 256 LOOKUP calls. With lookupcache=pos,
there will be one LOOKUP operation for every object in the repository, which
in case of the gitlab-ce repo is about 1.3 million times.

The attached patch removes the collision check when cloning into a new
repository. The performance of git clone with this patch is:

  lookupcache=pos (with patch): 577 seconds

I'd welcome feedback on the attached patch and whether my assumption that the
sha1 collision check can be safely omitted when cloning into a new repository
is correct.

Signed-off-by: Geert Jansen <gerardu@amazon.com>
---
 builtin/index-pack.c | 5 ++++-
 fetch-pack.c         | 2 ++
 2 files changed, 6 insertions(+), 1 deletion(-)

Comments

Junio C Hamano Oct. 26, 2018, 12:21 a.m. UTC | #1
"Jansen, Geert" <gerardu@amazon.com> writes:

> The index-pack command determines if a sha1 collision test is needed by
> checking the existence of a loose object with the given hash. In my tests, I
> can improve performance of “git clone” on Amazon EFS by 8x when used with a
> non-default mount option (lookupcache=pos) that's required for a Gitlab HA
> setup. My assumption is that this check is unnecessary when cloning into a new
> repository because the repository will be empty.

My knee-jerk reaction is that your insight that we can skip the "dup
check" when starting from emptiness is probably correct, but your
use of .cloning flag as an approximation of "are we starting from
emptiness?" is probably wrong, at least for two reasons.

 - "git clone --reference=..." does not strictly start from
   emptiness, and would still have to make sure that incoming pack
   does not try to inject an object with different contents but with
   the same name.

 - "git init && git fetch ..." starts from emptiness and would want
   to benefit from the same optimization as you are implementing
   here.

As to the implementation, I think the patch adjusts the right "if()"
condition to skip the collision test here.

> -	if (startup_info->have_repository) {
> +	if (startup_info->have_repository && !cloning) {
>  		read_lock();
>  		collision_test_needed =
>  			has_sha1_file_with_flags(oid->hash, OBJECT_INFO_QUICK);

I just do not think !cloning is quite correct.

Thanks.
Ævar Arnfjörð Bjarmason Oct. 26, 2018, 8:38 p.m. UTC | #2
On Fri, Oct 26 2018, Junio C Hamano wrote:

> "Jansen, Geert" <gerardu@amazon.com> writes:
>
>> The index-pack command determines if a sha1 collision test is needed by
>> checking the existence of a loose object with the given hash. In my tests, I
>> can improve performance of “git clone” on Amazon EFS by 8x when used with a
>> non-default mount option (lookupcache=pos) that's required for a Gitlab HA
>> setup. My assumption is that this check is unnecessary when cloning into a new
>> repository because the repository will be empty.
>
> My knee-jerk reaction is that your insight that we can skip the "dup
> check" when starting from emptiness is probably correct, but your
> use of .cloning flag as an approximation of "are we starting from
> emptiness?" is probably wrong, at least for two reasons.
>
>  - "git clone --reference=..." does not strictly start from
>    emptiness, and would still have to make sure that incoming pack
>    does not try to inject an object with different contents but with
>    the same name.
>
>  - "git init && git fetch ..." starts from emptiness and would want
>    to benefit from the same optimization as you are implementing
>    here.
>
> As to the implementation, I think the patch adjusts the right "if()"
> condition to skip the collision test here.
>
>> -	if (startup_info->have_repository) {
>> +	if (startup_info->have_repository && !cloning) {
>>  		read_lock();
>>  		collision_test_needed =
>>  			has_sha1_file_with_flags(oid->hash, OBJECT_INFO_QUICK);
>
> I just do not think !cloning is quite correct.

Geert: Thanks for working on this. A GitLab instance I'm involved in
managing at work has suffered from this issue, e.g. with "fork" being a
"clone" under the hood on GitLab, and taking ages on the NetApp NFS
filer due to this issue, so I'm very interested in this moving forward.

But as Junio notes the devil's in the details, another one I thought of
is:

    GIT_OBJECT_DIRECTORY=/some/other/repository git clone ...

It seems to me that it's better to stick this into
setup_git_directory_gently() in setup.c and check various edge cases
there. I.e. make this a new "have_object_already" member of the
startup_info struct.

That would be set depending on whether we find objects/packs in the
objects dir, and would know about GIT_OBJECT_DIRECTORY (either just
punting, or looking at those too). It would also need to know about
read_info_alternates(), depending on when that's checked it would handle
git clone --reference too since it just sets it up via
add_to_alternates_file().

Then you'd change sha1_object() to just check
startup_info->have_objects_already instead of
startup_info->have_repository, and since we'd checked "did we have
objects already?" it would work for the init && fetch case Junio
mentioned.

It would also be very nice to have a test for this, even if it's
something OS-specific that only works on Linux after we probe for
strace(1).

Testing (without your patch, because git-am barfs on it , seems to
dislake the base64 encoding):

    rm -rf /tmp/df; strace -f git clone --bare git@github.com:git/git.git /tmp/df 2>&1 | grep '".*ENOENT' 2>&1|perl -pe 's/^.*?"([^"]+)".*/$1/'|sort|uniq -c|sort -nr|less

I see we also check if packed-refs exists ~2800 times, and check for
each ref we find on the remote. Those are obviously less of a
performance issue when cloning in the common case, but perhaps another
place where we can insert a "don't check, we don't have anything
already" condition.
Junio C Hamano Oct. 27, 2018, 7:26 a.m. UTC | #3
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> But as Junio notes the devil's in the details, another one I thought of
> is:
>
>     GIT_OBJECT_DIRECTORY=/some/other/repository git clone ...
>
> It seems to me that ...

Actually I take all of that back ;-)

For the purpose of this patch, use of existing .cloning field in the
transport is fine, as the sole existing user of the field wants the
field to mean "Are we starting with an empty object store?", and not
"Are we running the command whose name is 'git clone'?".

Now, the logic used to set the field to true may have room to be
improved.  We should do that as a separte and orthogonal effort so
that the two cases I mentioned plus the new one(s) you brought up
would also be taken into account, so that we can set the .cloning
field more accurately to suite the two callers' needs---one caller
is the age-old one added by beea4152 ("clone: support remote shallow
repository", 2013-12-05), and the other one is what Geert is adding
in this thread.

We _may_ want to rename that transport.cloning field to better
reflect what it truly means (it is not "are we cloning?" but "are
there any objects in the repo to worry about?") as a preparatory
step before Geert's patch, but I do not think we should make it a
requirement to take the "let's skip collision check" patch to
improve the logic used to set that .cloning field.
Jeff King Oct. 27, 2018, 9:33 a.m. UTC | #4
On Sat, Oct 27, 2018 at 04:26:50PM +0900, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
> 
> > But as Junio notes the devil's in the details, another one I thought of
> > is:
> >
> >     GIT_OBJECT_DIRECTORY=/some/other/repository git clone ...
> >
> > It seems to me that ...
> 
> Actually I take all of that back ;-)
> 
> For the purpose of this patch, use of existing .cloning field in the
> transport is fine, as the sole existing user of the field wants the
> field to mean "Are we starting with an empty object store?", and not
> "Are we running the command whose name is 'git clone'?".

Taking one step back, the root problem in this thread is that stat() on
non-existing files is slow (which makes has_sha1_file slow).

One solution there is to cache the results of looking in .git/objects
(or any alternate object store) for loose files. And indeed, this whole
scheme is just a specialized form of that: it's a flag to say "hey, we
do not have any objects yet, so do not bother looking".

Could we implement that in a more direct and central way? And could we
implement it in a way that catches more cases? E.g., if I have _one_
object, that defeats this specialized optimization, but it is probably
still beneficial to cache that knowledge (and the reasonable cutoff is
probably not 1, but some value of N loose objects).

Of course any cache raises questions of cache invalidation, but I think
we've already dealt with that for this case. When we use
OBJECT_INFO_QUICK, that is a sign that we want to make this kind of
accuracy/speed tradeoff (which does a similar caching thing with
packfiles).

So putting that all together, could we have something like:

diff --git a/object-store.h b/object-store.h
index 63b7605a3e..28cde568a0 100644
--- a/object-store.h
+++ b/object-store.h
@@ -135,6 +135,18 @@ struct raw_object_store {
 	 */
 	struct packed_git *all_packs;
 
+	/*
+	 * A set of all loose objects we have. This probably ought to be split
+	 * into a set of 256 caches so that we can fault in one directory at a
+	 * time.
+	 */
+	struct oid_array loose_cache;
+	enum {
+		LOOSE_CACHE_UNFILLED = 0,
+		LOOSE_CACHE_INVALID,
+		LOOSE_CACHE_VALID
+	} loose_cache_status;
+
 	/*
 	 * A fast, rough count of the number of objects in the repository.
 	 * These two fields are not meant for direct access. Use
diff --git a/packfile.c b/packfile.c
index 86074a76e9..68ca4fff0e 100644
--- a/packfile.c
+++ b/packfile.c
@@ -990,6 +990,8 @@ void reprepare_packed_git(struct repository *r)
 	r->objects->approximate_object_count_valid = 0;
 	r->objects->packed_git_initialized = 0;
 	prepare_packed_git(r);
+	oid_array_clear(&r->objects->loose_cache);
+	r->objects->loose_cache_status = LOOSE_CACHE_UNFILLED;
 }
 
 struct packed_git *get_packed_git(struct repository *r)
diff --git a/sha1-file.c b/sha1-file.c
index dd0b6aa873..edbe037eaa 100644
--- a/sha1-file.c
+++ b/sha1-file.c
@@ -1172,6 +1172,40 @@ int parse_sha1_header(const char *hdr, unsigned long *sizep)
 	return parse_sha1_header_extended(hdr, &oi, 0);
 }
 
+/* probably should be configurable? */
+#define LOOSE_OBJECT_CACHE_MAX 65536
+
+static int fill_loose_cache(const struct object_id *oid,
+			    const char *path,
+			    void *data)
+{
+	struct oid_array *cache = data;
+
+	if (cache->nr == LOOSE_OBJECT_CACHE_MAX)
+		return -1;
+
+	oid_array_append(data, oid);
+	return 0;
+}
+
+static int quick_has_loose(struct raw_object_store *r,
+			   struct object_id *oid)
+{
+	struct oid_array *cache = &r->loose_cache;
+
+	if (r->loose_cache_status == LOOSE_CACHE_UNFILLED) {
+		if (for_each_loose_object(fill_loose_cache, cache, 0) < 0)
+			r->loose_cache_status = LOOSE_CACHE_INVALID;
+		else
+			r->loose_cache_status = LOOSE_CACHE_VALID;
+	}
+
+	if (r->loose_cache_status == LOOSE_CACHE_INVALID)
+		return -1;
+
+	return oid_array_lookup(cache, oid) >= 0;
+}
+
 static int sha1_loose_object_info(struct repository *r,
 				  const unsigned char *sha1,
 				  struct object_info *oi, int flags)
@@ -1198,6 +1232,19 @@ static int sha1_loose_object_info(struct repository *r,
 	if (!oi->typep && !oi->type_name && !oi->sizep && !oi->contentp) {
 		const char *path;
 		struct stat st;
+		if (!oi->disk_sizep && (flags & OBJECT_INFO_QUICK)) {
+			struct object_id oid;
+			hashcpy(oid.hash, sha1);
+			switch (quick_has_loose(r->objects, &oid)) {
+			case 0:
+				return -1; /* missing: error */
+			case 1:
+				return 0; /* have: 0 == success */
+			default:
+				/* unknown; fall back to stat */
+				break;
+			}
+		}
 		if (stat_sha1_file(r, sha1, &st, &path) < 0)
 			return -1;
 		if (oi->disk_sizep)

That's mostly untested, but it might be enough to run some timing tests
with. I think if we want to pursue this, we'd want to address the bits I
mentioned in the comments, and look at unifying this with the loose
cache from cc817ca3ef (which if I had remembered we added, probably
would have saved some time writing the above ;) ).

-Peff
Ævar Arnfjörð Bjarmason Oct. 27, 2018, 11:22 a.m. UTC | #5
On Sat, Oct 27 2018, Jeff King wrote:

> On Sat, Oct 27, 2018 at 04:26:50PM +0900, Junio C Hamano wrote:
>
>> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>>
>> > But as Junio notes the devil's in the details, another one I thought of
>> > is:
>> >
>> >     GIT_OBJECT_DIRECTORY=/some/other/repository git clone ...
>> >
>> > It seems to me that ...
>>
>> Actually I take all of that back ;-)
>>
>> For the purpose of this patch, use of existing .cloning field in the
>> transport is fine, as the sole existing user of the field wants the
>> field to mean "Are we starting with an empty object store?", and not
>> "Are we running the command whose name is 'git clone'?".
>
> Taking one step back, the root problem in this thread is that stat() on
> non-existing files is slow (which makes has_sha1_file slow).
>
> One solution there is to cache the results of looking in .git/objects
> (or any alternate object store) for loose files. And indeed, this whole
> scheme is just a specialized form of that: it's a flag to say "hey, we
> do not have any objects yet, so do not bother looking".
>
> Could we implement that in a more direct and central way? And could we
> implement it in a way that catches more cases? E.g., if I have _one_
> object, that defeats this specialized optimization, but it is probably
> still beneficial to cache that knowledge (and the reasonable cutoff is
> probably not 1, but some value of N loose objects).
>
> Of course any cache raises questions of cache invalidation, but I think
> we've already dealt with that for this case. When we use
> OBJECT_INFO_QUICK, that is a sign that we want to make this kind of
> accuracy/speed tradeoff (which does a similar caching thing with
> packfiles).
>
> So putting that all together, could we have something like:
>
> diff --git a/object-store.h b/object-store.h
> index 63b7605a3e..28cde568a0 100644
> --- a/object-store.h
> +++ b/object-store.h
> @@ -135,6 +135,18 @@ struct raw_object_store {
>  	 */
>  	struct packed_git *all_packs;
>
> +	/*
> +	 * A set of all loose objects we have. This probably ought to be split
> +	 * into a set of 256 caches so that we can fault in one directory at a
> +	 * time.
> +	 */
> +	struct oid_array loose_cache;
> +	enum {
> +		LOOSE_CACHE_UNFILLED = 0,
> +		LOOSE_CACHE_INVALID,
> +		LOOSE_CACHE_VALID
> +	} loose_cache_status;
> +
>  	/*
>  	 * A fast, rough count of the number of objects in the repository.
>  	 * These two fields are not meant for direct access. Use
> diff --git a/packfile.c b/packfile.c
> index 86074a76e9..68ca4fff0e 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -990,6 +990,8 @@ void reprepare_packed_git(struct repository *r)
>  	r->objects->approximate_object_count_valid = 0;
>  	r->objects->packed_git_initialized = 0;
>  	prepare_packed_git(r);
> +	oid_array_clear(&r->objects->loose_cache);
> +	r->objects->loose_cache_status = LOOSE_CACHE_UNFILLED;
>  }
>
>  struct packed_git *get_packed_git(struct repository *r)
> diff --git a/sha1-file.c b/sha1-file.c
> index dd0b6aa873..edbe037eaa 100644
> --- a/sha1-file.c
> +++ b/sha1-file.c
> @@ -1172,6 +1172,40 @@ int parse_sha1_header(const char *hdr, unsigned long *sizep)
>  	return parse_sha1_header_extended(hdr, &oi, 0);
>  }
>
> +/* probably should be configurable? */
> +#define LOOSE_OBJECT_CACHE_MAX 65536
> +
> +static int fill_loose_cache(const struct object_id *oid,
> +			    const char *path,
> +			    void *data)
> +{
> +	struct oid_array *cache = data;
> +
> +	if (cache->nr == LOOSE_OBJECT_CACHE_MAX)
> +		return -1;
> +
> +	oid_array_append(data, oid);
> +	return 0;
> +}
> +
> +static int quick_has_loose(struct raw_object_store *r,
> +			   struct object_id *oid)

The assumption with making it exactly 0 objects and not any value of >0
is that we can safely assume that a "clone" or initial "fetch"[1] is
special in ways that a clone isn't. I.e. we're starting out with nothing
and doing the initial population, that's probably not as true in an
existing repo that's getting concurrent fetches, commits, rebases etc.

But in the spirit of taking a step back, maybe we should take two steps
back and consider why we're doing this at all.

Three of our tests fail if we compile git like this, and cloning is much
faster (especially on NFS):

    diff --git a/builtin/index-pack.c b/builtin/index-pack.c
    index 2004e25da2..0c2d008ee0 100644
    --- a/builtin/index-pack.c
    +++ b/builtin/index-pack.c
    @@ -796,3 +796,3 @@ static void sha1_object(const void *data, struct object_entry *obj_entry,

    -       if (startup_info->have_repository) {
    +       if (0) {
                    read_lock();

Even on a local disk I'm doing 262759 lstat() calls cloning git.git and
spending 5% of my time on that.

But why do we have this in the first place? It's because of 8685da4256
("don't ever allow SHA1 collisions to exist by fetching a pack",
2007-03-20) and your 51054177b3 ("index-pack: detect local corruption in
collision check", 2017-04-01).

I.e. we are worried about (and those tests check for):

 a) A malicious user trying to give us repository where they have
    created an object with the same SHA-1 that's different, as in the
    SHAttered attack.

    I remember (but have not dug up) an old E-Mail from Linus saying
    that this was an important security aspect of git, i.e. even if
    SHA-1 was broken you couldn't easily propagate bad objects.

 b) Cases where we've ended up with different content for a SHA-1 due to
    e.g. a local FS corruption. Which is the subject of your commit in
    2017.

 c) Are there cases where fetch.fsckObjects is off and we just flip a
    bit on the wire and don't notice? I think not because we always
    check the pack checksum (don't we), but I'm not 100% sure.

I'm inclined to think that we should at the very least make this
configurable. Running into a) is the least of my worries when operating
some git server on NFS.

The b) case is also a concern, but in that case we'd actually be
improving things by writing out the duplicate object, if that was
followed-up with something like the "null" negotiator[2] and gc/repack
being able to look at the two objects, check their SHA-1/content and
throw away the bad one we'd have the ability to heal a corrupt
repository where we now just produce a hard error.

Even if someone wants to make the argument that this is behavior that we
absolutely *MUST* keep and not make configurable, there's still much
smarter ways to do it.

We could e.g. just unconditionally write out the packfile into a
quarantine environment (see 720dae5a19 ("config doc: elaborate on
fetch.fsckObjects security", 2018-07-27)), *then* loop over the loose
objects and packs we have and see if any of those exist in the new pack,
if they do, do the current assertion, and if not (and fetch.fsckObjects
passes) move it out of the quarantine.

I'm most inclined to say we should just have a config option to disable
this in lieu of fancier solutions. I think a) is entirely implausible
(and I'm not worrying about state-level actors attacking my git repos),
and b) would be no worse than it is today.

1. Although less so for initial fetch, think a) setup bunch of remotes b)
   parallel 'git fetch {}' ::: $(git remote)

2. https://public-inbox.org/git/87o9ciisg6.fsf@evledraar.gmail.com/
Duy Nguyen Oct. 27, 2018, 2:04 p.m. UTC | #6
On Sat, Oct 27, 2018 at 11:34 AM Jeff King <peff@peff.net> wrote:
> Taking one step back, the root problem in this thread is that stat() on
> non-existing files is slow (which makes has_sha1_file slow).
>
> One solution there is to cache the results of looking in .git/objects
> (or any alternate object store) for loose files. And indeed, this whole
> scheme is just a specialized form of that: it's a flag to say "hey, we
> do not have any objects yet, so do not bother looking".
>
> Could we implement that in a more direct and central way? And could we
> implement it in a way that catches more cases? E.g., if I have _one_
> object, that defeats this specialized optimization, but it is probably
> still beneficial to cache that knowledge (and the reasonable cutoff is
> probably not 1, but some value of N loose objects).

And we do hit this on normal git-fetch. The larger the received pack
is (e.g. if you don't fetch often), the more stat() we do, so your
approach is definitely better.

> Of course any cache raises questions of cache invalidation, but I think
> we've already dealt with that for this case. When we use
> OBJECT_INFO_QUICK, that is a sign that we want to make this kind of
> accuracy/speed tradeoff (which does a similar caching thing with
> packfiles).

We don't care about a separate process adding more loose objects while
index-pack is running, do we? I'm guessing we don't but just to double
check...

> So putting that all together, could we have something like:
>
> diff --git a/object-store.h b/object-store.h
> index 63b7605a3e..28cde568a0 100644
> --- a/object-store.h
> +++ b/object-store.h
> @@ -135,6 +135,18 @@ struct raw_object_store {
>          */
>         struct packed_git *all_packs;
>
> +       /*
> +        * A set of all loose objects we have. This probably ought to be split
> +        * into a set of 256 caches so that we can fault in one directory at a
> +        * time.
> +        */
> +       struct oid_array loose_cache;
> +       enum {
> +               LOOSE_CACHE_UNFILLED = 0,
> +               LOOSE_CACHE_INVALID,
> +               LOOSE_CACHE_VALID
> +       } loose_cache_status;
> +
>         /*
>          * A fast, rough count of the number of objects in the repository.
>          * These two fields are not meant for direct access. Use
> diff --git a/packfile.c b/packfile.c
> index 86074a76e9..68ca4fff0e 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -990,6 +990,8 @@ void reprepare_packed_git(struct repository *r)
>         r->objects->approximate_object_count_valid = 0;
>         r->objects->packed_git_initialized = 0;
>         prepare_packed_git(r);
> +       oid_array_clear(&r->objects->loose_cache);
> +       r->objects->loose_cache_status = LOOSE_CACHE_UNFILLED;
>  }
>
>  struct packed_git *get_packed_git(struct repository *r)
> diff --git a/sha1-file.c b/sha1-file.c
> index dd0b6aa873..edbe037eaa 100644
> --- a/sha1-file.c
> +++ b/sha1-file.c
> @@ -1172,6 +1172,40 @@ int parse_sha1_header(const char *hdr, unsigned long *sizep)
>         return parse_sha1_header_extended(hdr, &oi, 0);
>  }
>
> +/* probably should be configurable? */

Yes, perhaps with gc.auto config value (multiplied by 256) as the cut
point. If it's too big maybe just go with a bloom filter. For this
particular case we expect like 99% of calls to miss.

> +#define LOOSE_OBJECT_CACHE_MAX 65536
> +
> +static int fill_loose_cache(const struct object_id *oid,
> +                           const char *path,
> +                           void *data)
> +{
> +       struct oid_array *cache = data;
> +
> +       if (cache->nr == LOOSE_OBJECT_CACHE_MAX)
> +               return -1;
> +
> +       oid_array_append(data, oid);
> +       return 0;
> +}
> +
> +static int quick_has_loose(struct raw_object_store *r,
> +                          struct object_id *oid)
> +{
> +       struct oid_array *cache = &r->loose_cache;
> +
> +       if (r->loose_cache_status == LOOSE_CACHE_UNFILLED) {
> +               if (for_each_loose_object(fill_loose_cache, cache, 0) < 0)
> +                       r->loose_cache_status = LOOSE_CACHE_INVALID;
> +               else
> +                       r->loose_cache_status = LOOSE_CACHE_VALID;
> +       }
> +
> +       if (r->loose_cache_status == LOOSE_CACHE_INVALID)
> +               return -1;
> +
> +       return oid_array_lookup(cache, oid) >= 0;
> +}
> +
>  static int sha1_loose_object_info(struct repository *r,
>                                   const unsigned char *sha1,
>                                   struct object_info *oi, int flags)
> @@ -1198,6 +1232,19 @@ static int sha1_loose_object_info(struct repository *r,
>         if (!oi->typep && !oi->type_name && !oi->sizep && !oi->contentp) {
>                 const char *path;
>                 struct stat st;
> +               if (!oi->disk_sizep && (flags & OBJECT_INFO_QUICK)) {
> +                       struct object_id oid;
> +                       hashcpy(oid.hash, sha1);
> +                       switch (quick_has_loose(r->objects, &oid)) {
> +                       case 0:
> +                               return -1; /* missing: error */
> +                       case 1:
> +                               return 0; /* have: 0 == success */
> +                       default:
> +                               /* unknown; fall back to stat */
> +                               break;
> +                       }
> +               }
>                 if (stat_sha1_file(r, sha1, &st, &path) < 0)
>                         return -1;
>                 if (oi->disk_sizep)
>
> That's mostly untested, but it might be enough to run some timing tests
> with. I think if we want to pursue this, we'd want to address the bits I
> mentioned in the comments, and look at unifying this with the loose
> cache from cc817ca3ef (which if I had remembered we added, probably
> would have saved some time writing the above ;) ).
>
> -Peff
Junio C Hamano Oct. 29, 2018, 12:48 a.m. UTC | #7
Jeff King <peff@peff.net> writes:

> Of course any cache raises questions of cache invalidation, but I think
> we've already dealt with that for this case. When we use
> OBJECT_INFO_QUICK, that is a sign that we want to make this kind of
> accuracy/speed tradeoff (which does a similar caching thing with
> packfiles).
>
> So putting that all together, could we have something like:

I think this conceptually is a vast improvement relative to
".cloning" optimization.  Obviously this does not have the huge
downside of the other approach that turns the collision detection
completely off.

A real question is how much performance gain, relative to ".cloning"
thing, this approach gives us.  If it gives us 80% or more of the
gain compared to doing no checking, I'd say we have a clear winner.

> That's mostly untested, but it might be enough to run some timing tests
> with. I think if we want to pursue this, we'd want to address the bits I
> mentioned in the comments, and look at unifying this with the loose
> cache from cc817ca3ef (which if I had remembered we added, probably
> would have saved some time writing the above ;) ).

Yup.
Jeff King Oct. 29, 2018, 3:04 p.m. UTC | #8
On Sat, Oct 27, 2018 at 01:22:16PM +0200, Ævar Arnfjörð Bjarmason wrote:

> > Taking one step back, the root problem in this thread is that stat() on
> > non-existing files is slow (which makes has_sha1_file slow).
> >
> > One solution there is to cache the results of looking in .git/objects
> > (or any alternate object store) for loose files. And indeed, this whole
> > scheme is just a specialized form of that: it's a flag to say "hey, we
> > do not have any objects yet, so do not bother looking".
> >
> > Could we implement that in a more direct and central way? And could we
> > implement it in a way that catches more cases? E.g., if I have _one_
> > object, that defeats this specialized optimization, but it is probably
> > still beneficial to cache that knowledge (and the reasonable cutoff is
> > probably not 1, but some value of N loose objects).
> [...]
> 
> The assumption with making it exactly 0 objects and not any value of >0
> is that we can safely assume that a "clone" or initial "fetch"[1] is
> special in ways that a clone isn't. I.e. we're starting out with nothing
> and doing the initial population, that's probably not as true in an
> existing repo that's getting concurrent fetches, commits, rebases etc.

I assume you mean s/that a clone isn't/that a fetch isn't/.

I agree there are cases where you might be able to go further if you
assume a full "0". But my point is that "clone" is an ambiguous concept,
and it doesn't map completely to what's actually slow here. So if you
only look at "are we cloning", then:

  - you have a bunch of cases which are "clones", but aren't actually
    starting from scratch

  - you get zero benefit in the non-clone cases, when we could be
    scaling the benefit smoothly

> But in the spirit of taking a step back, maybe we should take two steps
> back and consider why we're doing this at all.

OK, I think it's worth discussing, and I'll do that below. But first I
want to say...

> Three of our tests fail if we compile git like this, and cloning is much
> faster (especially on NFS):
> 
>     diff --git a/builtin/index-pack.c b/builtin/index-pack.c
>     index 2004e25da2..0c2d008ee0 100644
>     --- a/builtin/index-pack.c
>     +++ b/builtin/index-pack.c
>     @@ -796,3 +796,3 @@ static void sha1_object(const void *data, struct object_entry *obj_entry,
> 
>     -       if (startup_info->have_repository) {
>     +       if (0) {
>                     read_lock();
> 
> Even on a local disk I'm doing 262759 lstat() calls cloning git.git and
> spending 5% of my time on that.

With the caching patch I posted earlier, I see roughly the same speedup
on an index-pack of git.git as I do with disabling the collision check
entirely (I did see about a 1% difference in favor of what you wrote
above, which was within the noise, but may well be valid due to slightly
reduced lock contention).

TBH I'm not sure if any of this is actually worth caring about on a
normal Linux system, though. There stat() is fast. It might be much more
interesting on macOS or Windows, or on a Linux system on NFS.

> But why do we have this in the first place? It's because of 8685da4256
> ("don't ever allow SHA1 collisions to exist by fetching a pack",
> 2007-03-20) and your 51054177b3 ("index-pack: detect local corruption in
> collision check", 2017-04-01).
> 
> I.e. we are worried about (and those tests check for):
> 
>  a) A malicious user trying to give us repository where they have
>     created an object with the same SHA-1 that's different, as in the
>     SHAttered attack.
> 
>     I remember (but have not dug up) an old E-Mail from Linus saying
>     that this was an important security aspect of git, i.e. even if
>     SHA-1 was broken you couldn't easily propagate bad objects.

Yeah, especially given recent advances in SHA-1 attacks, I'm not super
comfortable with the idea of disabling the duplicate-object check at
this point.

>  b) Cases where we've ended up with different content for a SHA-1 due to
>     e.g. a local FS corruption. Which is the subject of your commit in
>     2017.

Sort of. We actually detected it before my patch, but we just gave a
really crappy error message. ;)

>  c) Are there cases where fetch.fsckObjects is off and we just flip a
>     bit on the wire and don't notice? I think not because we always
>     check the pack checksum (don't we), but I'm not 100% sure.

We'd detect bit-blips on the wire due to the pack checksum. But what's
more interesting are bit-flips on the disk of the sender, which would
then put the bogus data into the pack checksum they generate on the fly.

However, we do detect such a bit-flip, even without fsckObjects, because
the sender does not tell us the expected sha-1 of each object. It gives
us a stream of objects, and the receiver computes the sha-1's
themselves. So a bit flip manifests in the connectivity-check when we
say "hey, the other side should have sent us object X but did not" (we
do not say "gee, what is this object Y they sent?" because after not
seeing X, we do not know which objects would have been reachable, so we
have a whole bunch of such Y's).

fetch.fsckObjects is purely about doing semantic object-quality checks.
They're not even that expensive to do. The main reason they're disabled
is that there are many historical objects floating around that fail
them (I think it would be a useful exercise to sort the existing checks
by priority, downgrading many of them to warnings, and then setting the
default for fetch.fsckObjects to "reject anything above warning").

> Even if someone wants to make the argument that this is behavior that we
> absolutely *MUST* keep and not make configurable, there's still much
> smarter ways to do it.

I don't have any real object to a configuration like this, if people
want to experiment with it. But in contrast, the patch I showed earlier:

  - is safe enough to just turn on all the time, without the user having
    to configure anything nor make a safety tradeoff

  - speeds up all the other spots that use OBJECT_INFO_QUICK (like
    fetch's tag-following, or what appears to be the exact same
    optimization done manually inside mark_complete_and_common-ref()).

> We could e.g. just unconditionally write out the packfile into a
> quarantine environment (see 720dae5a19 ("config doc: elaborate on
> fetch.fsckObjects security", 2018-07-27)), *then* loop over the loose
> objects and packs we have and see if any of those exist in the new pack,
> if they do, do the current assertion, and if not (and fetch.fsckObjects
> passes) move it out of the quarantine.

Yes, I agree that would work, though it's a bigger architecture change.

-Peff
Jeff King Oct. 29, 2018, 3:09 p.m. UTC | #9
On Mon, Oct 29, 2018 at 11:04:53AM -0400, Jeff King wrote:

> > Even if someone wants to make the argument that this is behavior that we
> > absolutely *MUST* keep and not make configurable, there's still much
> > smarter ways to do it.
> 
> I don't have any real object to a configuration like this, if people
> want to experiment with it. But in contrast, the patch I showed earlier:
> 
>   - is safe enough to just turn on all the time, without the user having
>     to configure anything nor make a safety tradeoff
> 
>   - speeds up all the other spots that use OBJECT_INFO_QUICK (like
>     fetch's tag-following, or what appears to be the exact same
>     optimization done manually inside mark_complete_and_common-ref()).

One thing I forgot to add. We're focusing here on the case where the
objects _aren't_ present, and we're primarily trying to get rid of the
stat call.

But when we actually do see a duplicate, we open up the object and
actually compare the bytes. Eliminating the collision check entirely
would save that work, which is obviously something that can't be
improved by just caching the existence of loose objects.

I'm not sure how often that case happens in a normal repository. We see
it a fair on GitHub servers because of the way we use alternates (i.e.,
we often already have the object you pushed up because it's present in
another fork and available via objects/info/alternates).

-Peff
Jeff King Oct. 29, 2018, 3:18 p.m. UTC | #10
On Sat, Oct 27, 2018 at 04:04:32PM +0200, Duy Nguyen wrote:

> > Of course any cache raises questions of cache invalidation, but I think
> > we've already dealt with that for this case. When we use
> > OBJECT_INFO_QUICK, that is a sign that we want to make this kind of
> > accuracy/speed tradeoff (which does a similar caching thing with
> > packfiles).
> 
> We don't care about a separate process adding more loose objects while
> index-pack is running, do we? I'm guessing we don't but just to double
> check...

Right. That's basically what QUICK means: don't bother re-examining the
repository to handle simultaneous writes, even if it means saying an
object is not there when it has recently appeared.

So far it has only applied to packs, but this is really just the same
concept (just as we would not notice a new pack arriving, we will not
notice a new loose object arriving).

> > +/* probably should be configurable? */
> > +#define LOOSE_OBJECT_CACHE_MAX 65536
> 
> Yes, perhaps with gc.auto config value (multiplied by 256) as the cut
> point. If it's too big maybe just go with a bloom filter. For this
> particular case we expect like 99% of calls to miss.

I wonder, though, if we should have a maximum at all. The existing
examples I've found of this technique are:

  - mark_complete_and_common_ref(), which is trying to cover this exact
    case. It looks like it avoids adding more objects than there are
    refs, so I guess it actually has a pretty small cutoff.

  - find_short_object_filename(), which does the same thing with no
    limits. And there if we _do_ have a lot of objects, we'd still
    prefer to keep the cache.

And really, this list is pretty much equivalent to looking at a pack
.idx. The only difference is that one is mmap'd, but here we'd use the
heap. So it's not shared between processes, but otherwise the working
set size is similar.

-Peff
Jeff King Oct. 29, 2018, 3:20 p.m. UTC | #11
On Mon, Oct 29, 2018 at 09:48:02AM +0900, Junio C Hamano wrote:

> > Of course any cache raises questions of cache invalidation, but I think
> > we've already dealt with that for this case. When we use
> > OBJECT_INFO_QUICK, that is a sign that we want to make this kind of
> > accuracy/speed tradeoff (which does a similar caching thing with
> > packfiles).
> >
> > So putting that all together, could we have something like:
> 
> I think this conceptually is a vast improvement relative to
> ".cloning" optimization.  Obviously this does not have the huge
> downside of the other approach that turns the collision detection
> completely off.
> 
> A real question is how much performance gain, relative to ".cloning"
> thing, this approach gives us.  If it gives us 80% or more of the
> gain compared to doing no checking, I'd say we have a clear winner.

My test runs showed it improving index-pack by about 3%, versus 4% for
no collision checking at all. But there was easily 1% of noise. And much
more importantly, that was on a Linux system on ext4, where stat is
fast. I'd be much more curious to hear timing results from people on
macOS or Windows, or from Geert's original NFS case.

-Peff
Ævar Arnfjörð Bjarmason Oct. 29, 2018, 6:43 p.m. UTC | #12
On Mon, Oct 29 2018, Jeff King wrote:

> On Mon, Oct 29, 2018 at 09:48:02AM +0900, Junio C Hamano wrote:
>
>> > Of course any cache raises questions of cache invalidation, but I think
>> > we've already dealt with that for this case. When we use
>> > OBJECT_INFO_QUICK, that is a sign that we want to make this kind of
>> > accuracy/speed tradeoff (which does a similar caching thing with
>> > packfiles).
>> >
>> > So putting that all together, could we have something like:
>>
>> I think this conceptually is a vast improvement relative to
>> ".cloning" optimization.  Obviously this does not have the huge
>> downside of the other approach that turns the collision detection
>> completely off.
>>
>> A real question is how much performance gain, relative to ".cloning"
>> thing, this approach gives us.  If it gives us 80% or more of the
>> gain compared to doing no checking, I'd say we have a clear winner.
>
> My test runs showed it improving index-pack by about 3%, versus 4% for
> no collision checking at all. But there was easily 1% of noise. And much
> more importantly, that was on a Linux system on ext4, where stat is
> fast. I'd be much more curious to hear timing results from people on
> macOS or Windows, or from Geert's original NFS case.

At work we make copious use of NetApp over NFS for filers. I'd say this
is probably typical for enterprise environments. Raw I/O performance
over the wire (writing a large file) is really good, but metadata
(e.g. stat) performance tends to be atrocious.

We both host the in-house Git server (GitLab) on such a filer (for HA
etc.), as well as many types of clients.

As noted by Geert upthread you need to mount the git directories with
lookupcache=positive (see e.g. [1]).

Cloning git.git as --bare onto such a partition with my patch:

    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
     60.98    1.802091          19     93896     19813 futex
     14.64    0.432782           7     61415        16 read
      9.40    0.277804           1    199576           pread64
      4.88    0.144172           3     49355        11 write
      3.10    0.091498          31      2919      2880 stat
      2.53    0.074812          31      2431       737 lstat
      1.96    0.057934           3     17257      1276 recvfrom
      0.91    0.026815           3      8543           select
      0.62    0.018425           2      8543           poll
    [...]
    real    0m32.053s
    user    0m21.451s
    sys     0m7.806s

Without:

    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
     71.01   31.653787          50    628265     21608 futex
     24.14   10.761950          41    260658    258964 lstat
      2.22    0.988001           5    199576           pread64
      1.32    0.587844          10     59662         3 read
      0.79    0.350625           7     50376        11 write
      0.22    0.096019          33      2919      2880 stat
      0.13    0.057950           4     15821        12 recvfrom
      0.05    0.022385           3      7949           select
      0.04    0.015988           2      7949           poll
      0.03    0.013622        3406         4           wait4
    [...]
    real    4m38.670s
    user    0m29.015s
    sys     0m33.894s

So a reduction in clone time by ~90%.

Performance would be basically the same with your patch. But let's
discuss that elsewhere in this thread. Just wanted to post the
performance numbers here.

1. https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/109#note_12528896
Ævar Arnfjörð Bjarmason Oct. 29, 2018, 7:36 p.m. UTC | #13
On Mon, Oct 29 2018, Jeff King wrote:

> On Sat, Oct 27, 2018 at 01:22:16PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> > Taking one step back, the root problem in this thread is that stat() on
>> > non-existing files is slow (which makes has_sha1_file slow).
>> >
>> > One solution there is to cache the results of looking in .git/objects
>> > (or any alternate object store) for loose files. And indeed, this whole
>> > scheme is just a specialized form of that: it's a flag to say "hey, we
>> > do not have any objects yet, so do not bother looking".
>> >
>> > Could we implement that in a more direct and central way? And could we
>> > implement it in a way that catches more cases? E.g., if I have _one_
>> > object, that defeats this specialized optimization, but it is probably
>> > still beneficial to cache that knowledge (and the reasonable cutoff is
>> > probably not 1, but some value of N loose objects).
>> [...]
>>
>> The assumption with making it exactly 0 objects and not any value of >0
>> is that we can safely assume that a "clone" or initial "fetch"[1] is
>> special in ways that a clone isn't. I.e. we're starting out with nothing
>> and doing the initial population, that's probably not as true in an
>> existing repo that's getting concurrent fetches, commits, rebases etc.
>
> I assume you mean s/that a clone isn't/that a fetch isn't/.

Yes, sorry.

> I agree there are cases where you might be able to go further if you
> assume a full "0". But my point is that "clone" is an ambiguous concept,
> and it doesn't map completely to what's actually slow here. So if you
> only look at "are we cloning", then:
>
>   - you have a bunch of cases which are "clones", but aren't actually
>     starting from scratch
>
>   - you get zero benefit in the non-clone cases, when we could be
>     scaling the benefit smoothly

Indeed. It's not special in principle, but I think in practice the
biggest wins are in the clone case, and unlike "fetch" we can safely
assume we're free of race conditions. More on that below.

>> But in the spirit of taking a step back, maybe we should take two steps
>> back and consider why we're doing this at all.
>
> OK, I think it's worth discussing, and I'll do that below. But first I
> want to say...
>
>> Three of our tests fail if we compile git like this, and cloning is much
>> faster (especially on NFS):
>>
>>     diff --git a/builtin/index-pack.c b/builtin/index-pack.c
>>     index 2004e25da2..0c2d008ee0 100644
>>     --- a/builtin/index-pack.c
>>     +++ b/builtin/index-pack.c
>>     @@ -796,3 +796,3 @@ static void sha1_object(const void *data, struct object_entry *obj_entry,
>>
>>     -       if (startup_info->have_repository) {
>>     +       if (0) {
>>                     read_lock();
>>
>> Even on a local disk I'm doing 262759 lstat() calls cloning git.git and
>> spending 5% of my time on that.
>
> With the caching patch I posted earlier, I see roughly the same speedup
> on an index-pack of git.git as I do with disabling the collision check
> entirely (I did see about a 1% difference in favor of what you wrote
> above, which was within the noise, but may well be valid due to slightly
> reduced lock contention).
>
> TBH I'm not sure if any of this is actually worth caring about on a
> normal Linux system, though. There stat() is fast. It might be much more
> interesting on macOS or Windows, or on a Linux system on NFS.

It matters a *lot* on NFS as my performance numbers in
https://public-inbox.org/git/87d0rslhkl.fsf@evledraar.gmail.com/ show.

>> But why do we have this in the first place? It's because of 8685da4256
>> ("don't ever allow SHA1 collisions to exist by fetching a pack",
>> 2007-03-20) and your 51054177b3 ("index-pack: detect local corruption in
>> collision check", 2017-04-01).
>>
>> I.e. we are worried about (and those tests check for):
>>
>>  a) A malicious user trying to give us repository where they have
>>     created an object with the same SHA-1 that's different, as in the
>>     SHAttered attack.
>>
>>     I remember (but have not dug up) an old E-Mail from Linus saying
>>     that this was an important security aspect of git, i.e. even if
>>     SHA-1 was broken you couldn't easily propagate bad objects.
>
> Yeah, especially given recent advances in SHA-1 attacks, I'm not super
> comfortable with the idea of disabling the duplicate-object check at
> this point.

I'd be comfortable with it in my setup since it's been limited to
collision attacks that are computationally prohibitive, and there being
no sign of preimage attacks, which is the case we really need to worry
about.

>>  b) Cases where we've ended up with different content for a SHA-1 due to
>>     e.g. a local FS corruption. Which is the subject of your commit in
>>     2017.
>
> Sort of. We actually detected it before my patch, but we just gave a
> really crappy error message. ;)
>
>>  c) Are there cases where fetch.fsckObjects is off and we just flip a
>>     bit on the wire and don't notice? I think not because we always
>>     check the pack checksum (don't we), but I'm not 100% sure.
>
> We'd detect bit-blips on the wire due to the pack checksum. But what's
> more interesting are bit-flips on the disk of the sender, which would
> then put the bogus data into the pack checksum they generate on the fly.
>
> However, we do detect such a bit-flip, even without fsckObjects, because
> the sender does not tell us the expected sha-1 of each object. It gives
> us a stream of objects, and the receiver computes the sha-1's
> themselves. So a bit flip manifests in the connectivity-check when we
> say "hey, the other side should have sent us object X but did not" (we
> do not say "gee, what is this object Y they sent?" because after not
> seeing X, we do not know which objects would have been reachable, so we
> have a whole bunch of such Y's).
>
> fetch.fsckObjects is purely about doing semantic object-quality checks.
> They're not even that expensive to do. The main reason they're disabled
> is that there are many historical objects floating around that fail
> them (I think it would be a useful exercise to sort the existing checks
> by priority, downgrading many of them to warnings, and then setting the
> default for fetch.fsckObjects to "reject anything above warning").

Thanks, that's really informative & useful.

>> Even if someone wants to make the argument that this is behavior that we
>> absolutely *MUST* keep and not make configurable, there's still much
>> smarter ways to do it.
>
> I don't have any real object to a configuration like this, if people
> want to experiment with it. But in contrast, the patch I showed earlier:
>
>   - is safe enough to just turn on all the time, without the user having
>     to configure anything nor make a safety tradeoff

I think it's a useful patch to carry forward, and agree that it should
be turned on by default.

It does introduce a race condition where you can introduce a colliding
object to the repository by doing two concurrent pushes, but as you note
in
https://public-inbox.org/git/20181029151842.GJ17668@sigill.intra.peff.net/
this already applies to packs, so you can trigger that with the right
sized push (depending on transfer.unpackLimit), and we also have this in
existing forms for other stuff.

I do think it's amazingly paranoid to be worried about SHA-1 collisions
in the first place, and a bit odd to leave the door open on these race
conditions. I.e. it's hard to imagine a state-level[1] actor with
sufficient motivation to exploit this who wouldn't find some way to make
the race condition work as an escape hatch.

I admit just leaving that race condition does close a lot of doors
entirely. I.e. you could sometimes trigger a collision but wouldn't have
the right conditions to exploit the race condition.

>   - speeds up all the other spots that use OBJECT_INFO_QUICK (like
>     fetch's tag-following, or what appears to be the exact same
>     optimization done manually inside mark_complete_and_common-ref()).

We also pay a constant cost of doing an opendir() / readdir() on however
many loose objects we have for every push on the server-side. While it's
not as bad as stat() in a loop that's also quite slow on NFS.

In a busy repo that gets a lot of branches / branch deletions (so not
quite as extreme as [2], but close) and the default expiry policy you
can easily have 20-100K loose objects (something near the lower bound of
that is the current live state of one server I'm looking at).

A recursive opendir()/readdir() on that on local disk is really fast if
it's in cache, but can easily be 1-5 seconds on NFS. So for a push we'd
now pay up to 5s just populating a cache we'll bearly use to accept some
tiny push with just a few objects.

I also found when writing "index-pack: add ability to disable SHA-1
collision check" that it's really handy to recover from some forms of
repo corruption, so I've documented that. So aside from the performance
case it's a useful knob to have.

So what I'll do is:

 * Re-roll my 4 patch series to include the patch you have in
   <20181027093300.GA23974@sigill.intra.peff.net>

 * Turn that behavior on by default, but have some knob to toggle it
   off, because as noted above on some performance sensitive NFS cases
   I'd really like to not have the cache *AND* not have the collision
   check, performance will suffer with the cache.

>> We could e.g. just unconditionally write out the packfile into a
>> quarantine environment (see 720dae5a19 ("config doc: elaborate on
>> fetch.fsckObjects security", 2018-07-27)), *then* loop over the loose
>> objects and packs we have and see if any of those exist in the new pack,
>> if they do, do the current assertion, and if not (and fetch.fsckObjects
>> passes) move it out of the quarantine.
>
> Yes, I agree that would work, though it's a bigger architecture change.

1. "state-level" because even though Google's collision cost ~$100k
   we're talking about a *much* harder problem in practice of doing
   something useful. E.g. replacing git.c with an actual exploit, not
   just two cute specially crafted PDFs.

2. https://public-inbox.org/git/87fu6bmr0j.fsf@evledraar.gmail.com/
Geert Jansen Oct. 29, 2018, 9:34 p.m. UTC | #14
On Mon, Oct 29, 2018 at 09:48:02AM +0900, Junio C Hamano wrote:

> A real question is how much performance gain, relative to ".cloning"
> thing, this approach gives us.  If it gives us 80% or more of the
> gain compared to doing no checking, I'd say we have a clear winner.

I've tested Jeff's loose-object-cache patch and the performance is within error
bounds of my .cloning patch. A git clone of the same repo as in my initial
tests:

  .cloning -> 10m04
  loose-object-cache -> 9m59

Jeff's patch does a little more work (256 readdir() calls, which in case of an
empty repo translate into 256 LOOKUP calls that return NFS4ERR_NOENT) but that
appears to be insignificant.

I agree that the loose-object-cache approach is preferred as it applies to more
git commands and also benefits performance when there are loose objects
already in the repository.

As pointed out in the thread, maybe the default cache size should be some
integer times gc.auto.

I believe the loose-object-cache approach would have a performance regression
when you're receiving a small pack file and there's many loose objects in the
repo. Basically you're trading off

    MIN(256, num_objects_in_pack / dentries_per_readdir) * readdir_latency
    
against

    num_loose_objects * stat_latency

On Amazon EFS (and I expect on other NFS server implementations too) it is more
efficient to do readdir() on a large directory than to stat() each of the
individual files in the same directory. I don't have exact numbers but based on
a very rough calculation the difference is roughly 10x for large directories
under normal circumstances.

As an example, this means that when you're recieving a pack file with 1K
objects in a repository with 10K loose objects that the loose-object-cache
patch has roughly the same performance as the current git. I'm not sure if this
is something to worry about as I'm not sure people run repos with this many
loose files. If it is a concern, there could be a flag to turn the loose object
cache on/off.
Jeff King Oct. 29, 2018, 9:50 p.m. UTC | #15
On Mon, Oct 29, 2018 at 09:34:53PM +0000, Geert Jansen wrote:

> On Mon, Oct 29, 2018 at 09:48:02AM +0900, Junio C Hamano wrote:
> 
> > A real question is how much performance gain, relative to ".cloning"
> > thing, this approach gives us.  If it gives us 80% or more of the
> > gain compared to doing no checking, I'd say we have a clear winner.
> 
> I've tested Jeff's loose-object-cache patch and the performance is within error
> bounds of my .cloning patch. A git clone of the same repo as in my initial
> tests:
> 
>   .cloning -> 10m04
>   loose-object-cache -> 9m59
> 
> Jeff's patch does a little more work (256 readdir() calls, which in case of an
> empty repo translate into 256 LOOKUP calls that return NFS4ERR_NOENT) but that
> appears to be insignificant.

Yep, that makes sense. Thanks for timing it.

> I believe the loose-object-cache approach would have a performance regression
> when you're receiving a small pack file and there's many loose objects in the
> repo. Basically you're trading off
> 
>     MIN(256, num_objects_in_pack / dentries_per_readdir) * readdir_latency
>     
> against
> 
>     num_loose_objects * stat_latency

Should num_loose_objects and num_objects_in_pack be swapped here? Just
making sure I understand what you're saying.

The patch I showed just blindly reads each of the 256 object
subdirectories. I think if we pursue this (and it seems like everybody
is on board), we should cache each of those individually. So a single
object would incur at most one opendir/readdir (and subsequent objects
may, too, or they may hit that cache if they share the first byte).

So the 256 in your MIN() is potentially much smaller. We still have to
deal with the fact that if you have a large number of loose objects,
they may be split cross multiple readdir (or getdents) calls. The "cache
maximum" we discussed does bound that, but in some ways that's worse:
you waste time doing the bounded amount of readdir and then don't even
get the benefit of the cache. ;)

> On Amazon EFS (and I expect on other NFS server implementations too) it is more
> efficient to do readdir() on a large directory than to stat() each of the
> individual files in the same directory. I don't have exact numbers but based on
> a very rough calculation the difference is roughly 10x for large directories
> under normal circumstances.

I'd expect readdir() to be much faster than stat() in general (e.g., "ls
-f" versus "ls -l" is faster even on a warm cache; there's more
formatting going on in the latter, but I think a lot of it is the effort
to stat).

-Peff
Geert Jansen Oct. 29, 2018, 10:21 p.m. UTC | #16
On Mon, Oct 29, 2018 at 05:50:50PM -0400, Jeff King wrote:

> > I believe the loose-object-cache approach would have a performance regression
> > when you're receiving a small pack file and there's many loose objects in the
> > repo. Basically you're trading off
> > 
> >     MIN(256, num_objects_in_pack / dentries_per_readdir) * readdir_latency
> >     
> > against
> > 
> >     num_loose_objects * stat_latency
> 
> Should num_loose_objects and num_objects_in_pack be swapped here? Just
> making sure I understand what you're saying.

Whoops, yes, thanks for spotting that!

> So the 256 in your MIN() is potentially much smaller. We still have to
> deal with the fact that if you have a large number of loose objects,
> they may be split cross multiple readdir (or getdents) calls. The "cache
> maximum" we discussed does bound that, but in some ways that's worse:
> you waste time doing the bounded amount of readdir and then don't even
> get the benefit of the cache. ;)

Yup. To get the performance benefit you'd like the cache to hold all loose
objects except in clearly degenerate cases with far too many loose objects.

> > On Amazon EFS (and I expect on other NFS server implementations too) it is more
> > efficient to do readdir() on a large directory than to stat() each of the
> > individual files in the same directory. I don't have exact numbers but based on
> > a very rough calculation the difference is roughly 10x for large directories
> > under normal circumstances.
> 
> I'd expect readdir() to be much faster than stat() in general (e.g., "ls
> -f" versus "ls -l" is faster even on a warm cache; there's more
> formatting going on in the latter, but I think a lot of it is the effort
> to stat).

In the case of NFS, the client usually requests that the READDIR response also
contains some of the stat flags (like st_mode). But even in this case it's
still more efficient to return multiple entries in one batch through READDIR
rather than as individual responses to GETATTR (which is what stat() maps to).
Jeff King Oct. 29, 2018, 10:27 p.m. UTC | #17
On Mon, Oct 29, 2018 at 09:34:53PM +0000, Geert Jansen wrote:

> As an example, this means that when you're recieving a pack file with 1K
> objects in a repository with 10K loose objects that the loose-object-cache
> patch has roughly the same performance as the current git. I'm not sure if this
> is something to worry about as I'm not sure people run repos with this many
> loose files. If it is a concern, there could be a flag to turn the loose object
> cache on/off.

So yeah, that's the other thing I'm thinking about regarding having a
maximum loose cache size.

10k objects is only 200KB in memory. That's basically nothing. At some
point you run into pathological cases, like having a million objects
(but that's still only 20MB, much less than we devote to other caches,
though of course they do add up).

If you have a million loose objects, I strongly suspect you're going to
run into other problems (like space, since you're not getting any
deltas).

The one thing that gives me pause is that if you have a bunch of unused
and unreachable loose objects on disk, most operations won't actually
look at them at all. The majority of operations are only looking for
objects we expect to be present (e.g., resolving a ref, walking a tree)
and are fulfilled by checking the pack indices first.

So it's possible that Git is _tolerable_ for most operations with a
million loose objects, and we could make it slightly worse by loading
the cache. But I find it hard to get too worked up about spending an
extra 20MB (and the time to readdir() it in) in that case. It seems like
about 400ms on my machine, and the correct next step is almost always
going to be "pack" or "prune" anyway.

-Peff
Stefan Beller Oct. 29, 2018, 10:35 p.m. UTC | #18
On Mon, Oct 29, 2018 at 3:27 PM Jeff King <peff@peff.net> wrote:

> So yeah, that's the other thing I'm thinking about regarding having a
> maximum loose cache size.

tangent:
This preloading/caching could be used for a more precise approach
to decide when to gc instead of using some statistical sampling
of objects/17, eventually.
Jeff King Oct. 29, 2018, 11:27 p.m. UTC | #19
On Mon, Oct 29, 2018 at 08:36:07PM +0100, Ævar Arnfjörð Bjarmason wrote:

> > Yeah, especially given recent advances in SHA-1 attacks, I'm not super
> > comfortable with the idea of disabling the duplicate-object check at
> > this point.
> 
> I'd be comfortable with it in my setup since it's been limited to
> collision attacks that are computationally prohibitive, and there being
> no sign of preimage attacks, which is the case we really need to worry
> about.

I agree, and I'm not actually that worried about the current state. But
what makes me more nervous is the life-cycle around Git. In 5 years,
people are still going to be running what we ship today, and will
grumble about upgrading to deal with SHA-1.

I suppose it's not the end of the world as long as they can un-flip a
config switch to get back the more-paranoid behavior (which is all that
you're really proposing).

> It does introduce a race condition where you can introduce a colliding
> object to the repository by doing two concurrent pushes, but as you note
> in
> https://public-inbox.org/git/20181029151842.GJ17668@sigill.intra.peff.net/
> this already applies to packs, so you can trigger that with the right
> sized push (depending on transfer.unpackLimit), and we also have this in
> existing forms for other stuff.

Right. It can also trigger currently if somebody runs "git repack"
simultaneously (the loose becomes packed, but we don't re-scan the pack
directory).

> I do think it's amazingly paranoid to be worried about SHA-1 collisions
> in the first place, and a bit odd to leave the door open on these race
> conditions. I.e. it's hard to imagine a state-level[1] actor with
> sufficient motivation to exploit this who wouldn't find some way to make
> the race condition work as an escape hatch.

Yeah, I agree there's an element of that. I think the "push twice
quickly to race" thing is actually not all that interesting, though. In
that case, you're providing both the objects already, so why not just
push the one you want?

What's more interesting is racing with the victim of your collision (I
feed Junio the good half of the collision, and then try to race his
push and get my evil half in at the same time). Or racing a repack. But
timing the race there seems a lot trickier.

I suspect you could open up the window substantially by feeding your
pack really slowly. So I start to push at 1pm, but trickle in a byte at
a time of my 1GB pack, taking several hours. Meanwhile Junio pushes, and
then as soon as I see that, I send the rest of my pack. My index-pack
doesn't see Junio's push because it started before.

And ditto with repack, if the servers runs it predictably in response to
load.  So maybe not so tricky after all.

I think the other thing that helps here is that _everybody_ runs the
collision check. So yeah, you can race pushing your evil stuff to my
server. But it only takes one person fetching into their quiescent
laptop repository to notice the collision and sound the alarm.

I'll admit that there's a whole lot of hand-waving there, for a security
claim. I'll be glad to simply move off of SHA-1.

> In a busy repo that gets a lot of branches / branch deletions (so not
> quite as extreme as [2], but close) and the default expiry policy you
> can easily have 20-100K loose objects (something near the lower bound of
> that is the current live state of one server I'm looking at).
> 
> A recursive opendir()/readdir() on that on local disk is really fast if
> it's in cache, but can easily be 1-5 seconds on NFS. So for a push we'd
> now pay up to 5s just populating a cache we'll bearly use to accept some
> tiny push with just a few objects.

That 1-5 seconds is a little scary. Locally for a million objects I was
looking at 400ms. But obviously NFS is going to be much worse.

I do agree with your sentiment below that even if this should be on by
default, it should have a config knob. After all, "please flip this
switch and see if things improve" is a good escape hatch to have.

>  * Re-roll my 4 patch series to include the patch you have in
>    <20181027093300.GA23974@sigill.intra.peff.net>

I don't think it's quite ready for inclusion as-is. I hope to brush it
up a bit, but I have quite a backlog of stuff to review, as well.

-Peff
Jeff King Oct. 29, 2018, 11:29 p.m. UTC | #20
On Mon, Oct 29, 2018 at 03:35:58PM -0700, Stefan Beller wrote:

> On Mon, Oct 29, 2018 at 3:27 PM Jeff King <peff@peff.net> wrote:
> 
> > So yeah, that's the other thing I'm thinking about regarding having a
> > maximum loose cache size.
> 
> tangent:
> This preloading/caching could be used for a more precise approach
> to decide when to gc instead of using some statistical sampling
> of objects/17, eventually.

Isn't it exactly the same thing? Ideally we'd break down the cache to
the directory level, so we could fill the cache list for "17" and ask
"how full are you". Or we could just readdir objects/17 ourselves. But
either way, it's the same amount of work.

-Peff
Geert Jansen Nov. 7, 2018, 10:55 p.m. UTC | #21
On Mon, Oct 29, 2018 at 07:27:39PM -0400, Jeff King wrote:

> On Mon, Oct 29, 2018 at 08:36:07PM +0100, Ævar Arnfjörð Bjarmason wrote:
> >  * Re-roll my 4 patch series to include the patch you have in
> >    <20181027093300.GA23974@sigill.intra.peff.net>
> 
> I don't think it's quite ready for inclusion as-is. I hope to brush it
> up a bit, but I have quite a backlog of stuff to review, as well.

We're still quite keen to get this patch included. Is there anything I can do
to help?

Also I just re-read your comments on maximum cache size. I think you were
arguing both sides of the equation and I wasn't sure where you'd ended up. :)
A larger cache size potentially takes more time to fill up especially on NFS
while a smaller cache size obviously would less effective. That said a small
cache is still effective for the "clone" case where the repo is empty.

It also occurred to me that as a performance optimization your patch could read
the the loose object directories in parallel using a thread pool. At least on
Amazon EFS this should result in al almost linear performance increase. I'm not
sure how much this would help for local file systems. In any case this may be
best done as a follow-up patch (that I'd be happy to volunteer for).
Jeff King Nov. 8, 2018, 12:02 p.m. UTC | #22
On Wed, Nov 07, 2018 at 10:55:24PM +0000, Geert Jansen wrote:

> On Mon, Oct 29, 2018 at 07:27:39PM -0400, Jeff King wrote:
> 
> > On Mon, Oct 29, 2018 at 08:36:07PM +0100, Ævar Arnfjörð Bjarmason wrote:
> > >  * Re-roll my 4 patch series to include the patch you have in
> > >    <20181027093300.GA23974@sigill.intra.peff.net>
> > 
> > I don't think it's quite ready for inclusion as-is. I hope to brush it
> > up a bit, but I have quite a backlog of stuff to review, as well.
> 
> We're still quite keen to get this patch included. Is there anything I can do
> to help?

Yes, testing and review. :)

I won't send the series out just yet, as I suspect it could use another
read-through on my part. But if you want to peek at it or try some
timings, it's available at:

  https://github.com/peff/git jk/loose-cache

It's quite a bit bigger than the original patch, as some refactoring was
necessary to reuse the existing cache in alternate_object_directories.
I'm rather pleased with how it turned out; it unifies the handling of
alternates and the main object directory, which is a cleanup I've been
wanting to do for some time.

> Also I just re-read your comments on maximum cache size. I think you were
> arguing both sides of the equation and I wasn't sure where you'd ended up. :)
> A larger cache size potentially takes more time to fill up especially on NFS
> while a smaller cache size obviously would less effective. That said a small
> cache is still effective for the "clone" case where the repo is empty.

I ended up thinking that a large cache is going to be fine. So I didn't
even bother implementing a limit in my series, which makes things a bit
simpler (it's one less state to deal with).

Since it reuses the existing cache code, it's better in a few ways than
my earlier patch:

  1. If a program uses OBJECT_INFO_QUICK and prints abbreviated sha1s,
     we only have to load the cache once (I think fetch does this, but I
     didn't test it).

  2. The cache is filled one directory at a time, which avoids
     unnecessary work when there are only a few lookups.

  3. The cache is per-object-directory. So if a request can be filled
     without looking at an alternate, we avoid looking at the alternate.
     I doubt this matters much in practice (the case we care about is
     when we _don't_ have the object, and there you have to look
     everywhere).

The one thing I didn't implement is a config option to disable this.
That would be pretty easy to add. I don't think it's necessary, but it
would make testing before/after behavior easier if somebody thinks it's
slowing down their particular case.

> It also occurred to me that as a performance optimization your patch could read
> the the loose object directories in parallel using a thread pool. At least on
> Amazon EFS this should result in al almost linear performance increase. I'm not
> sure how much this would help for local file systems. In any case this may be
> best done as a follow-up patch (that I'd be happy to volunteer for).

Yeah, I suspect it could make things faster in some cases. But it also
implies filling all of the cache directories at once up front. The code
I have now tries to avoid unnecessary cache fills. But it would be
pretty easy to kick off a full fill.

I agree it would make more sense as a follow-up patch (and probably
controlled by a config option, since it likely only makes sense when you
have a really high-latency readdir).

-Peff
Geert Jansen Nov. 8, 2018, 8:58 p.m. UTC | #23
On Thu, Nov 08, 2018 at 07:02:57AM -0500, Jeff King wrote:

> Yes, testing and review. :)
> 
> I won't send the series out just yet, as I suspect it could use another
> read-through on my part. But if you want to peek at it or try some
> timings, it's available at:
> 
>   https://github.com/peff/git jk/loose-cache

I gave this branch a go. There's a performance regression as I'm getting a
clone speed of about 100 KiB/s while with the previous patch I got around 20
MiB/s. The culprint appears to be a very large number of stat() calls on
".git/objects/info/alternates". The call stack is:

 -> quick_has_loose()
 -> prepare_alt_odb()
 -> read_info_alternates()
Jeff King Nov. 8, 2018, 9:18 p.m. UTC | #24
On Thu, Nov 08, 2018 at 08:58:19PM +0000, Geert Jansen wrote:

> On Thu, Nov 08, 2018 at 07:02:57AM -0500, Jeff King wrote:
> 
> > Yes, testing and review. :)
> > 
> > I won't send the series out just yet, as I suspect it could use another
> > read-through on my part. But if you want to peek at it or try some
> > timings, it's available at:
> > 
> >   https://github.com/peff/git jk/loose-cache
> 
> I gave this branch a go. There's a performance regression as I'm getting a
> clone speed of about 100 KiB/s while with the previous patch I got around 20
> MiB/s. The culprint appears to be a very large number of stat() calls on
> ".git/objects/info/alternates". The call stack is:
> 
>  -> quick_has_loose()
>  -> prepare_alt_odb()
>  -> read_info_alternates()

Heh, indeed. Try this on top:

diff --git a/sha1-file.c b/sha1-file.c
index bc35b28e17..9ff27f92ed 100644
--- a/sha1-file.c
+++ b/sha1-file.c
@@ -692,6 +692,7 @@ void prepare_alt_odb(struct repository *r)
 	link_alt_odb_entries(r, r->objects->alternate_db, PATH_SEP, NULL, 0);
 
 	read_info_alternates(r, r->objects->odb->path, 0);
+	r->objects->loaded_alternates = 1;
 }
 
 /* Returns 1 if we have successfully freshened the file, 0 otherwise. */

-Peff
Geert Jansen Nov. 8, 2018, 9:55 p.m. UTC | #25
On Thu, Nov 08, 2018 at 04:18:24PM -0500, Jeff King wrote:

> Heh, indeed. Try this on top:
> 
> diff --git a/sha1-file.c b/sha1-file.c
> index bc35b28e17..9ff27f92ed 100644
> --- a/sha1-file.c
> +++ b/sha1-file.c
> @@ -692,6 +692,7 @@ void prepare_alt_odb(struct repository *r)
>  	link_alt_odb_entries(r, r->objects->alternate_db, PATH_SEP, NULL, 0);
>  
>  	read_info_alternates(r, r->objects->odb->path, 0);
> +	r->objects->loaded_alternates = 1;
>  }
>  
>  /* Returns 1 if we have successfully freshened the file, 0 otherwise. */

Thanks, this did it. Performance is now back at the level of the previous patch.
Ævar Arnfjörð Bjarmason Nov. 8, 2018, 10:20 p.m. UTC | #26
On Thu, Nov 08 2018, Jeff King wrote:

> On Wed, Nov 07, 2018 at 10:55:24PM +0000, Geert Jansen wrote:
>
>> On Mon, Oct 29, 2018 at 07:27:39PM -0400, Jeff King wrote:
>>
>> > On Mon, Oct 29, 2018 at 08:36:07PM +0100, Ævar Arnfjörð Bjarmason wrote:
>> > >  * Re-roll my 4 patch series to include the patch you have in
>> > >    <20181027093300.GA23974@sigill.intra.peff.net>
>> >
>> > I don't think it's quite ready for inclusion as-is. I hope to brush it
>> > up a bit, but I have quite a backlog of stuff to review, as well.
>>
>> We're still quite keen to get this patch included. Is there anything I can do
>> to help?
>
> Yes, testing and review. :)
>
> I won't send the series out just yet, as I suspect it could use another
> read-through on my part. But if you want to peek at it or try some
> timings, it's available at:
>
>   https://github.com/peff/git jk/loose-cache

Just a comment on this from the series:

    Note that it is possible for this to actually be _slower_. We'll do a
    full readdir() to fill the cache, so if you have a very large number of
    loose objects and a very small number of lookups, that readdir() may end
    up more expensive.

    In practice, though, having a large number of loose objects is already a
    performance problem, which should be fixed by repacking or pruning via
    git-gc. So on balance, this should be a good tradeoff.

Our biggest repo has a very large number of loose objects at any given
time, but the vast majority of these are because gc *is* happening very
frequently and the default expiry policy of 2wks is in effect.

Having a large number of loose objects is not per-se a performance
problem.

It's a problem if you end up "faulting" to from packs to the loose
object directory a lot because those objects are still reachable, but if
they're not reachable that number can grow very large if your ref churn
is large (so lots of expired loose object production).

Anyway, the series per-se looks good to me. It's particularly nice to
have some of the ODB cleanup + cleanup in fetch-pack.c

Just wanted to note that in our default (reasonable) config we do
produce scenarios where this change can still be somewhat pathological,
so I'm still interested in disabling it entirely given the
implausibility of what it's guarding against.
Ævar Arnfjörð Bjarmason Nov. 9, 2018, 10:11 a.m. UTC | #27
On Thu, Nov 08 2018, Ævar Arnfjörð Bjarmason wrote:

> On Thu, Nov 08 2018, Jeff King wrote:
>
>> On Wed, Nov 07, 2018 at 10:55:24PM +0000, Geert Jansen wrote:
>>
>>> On Mon, Oct 29, 2018 at 07:27:39PM -0400, Jeff King wrote:
>>>
>>> > On Mon, Oct 29, 2018 at 08:36:07PM +0100, Ævar Arnfjörð Bjarmason wrote:
>>> > >  * Re-roll my 4 patch series to include the patch you have in
>>> > >    <20181027093300.GA23974@sigill.intra.peff.net>
>>> >
>>> > I don't think it's quite ready for inclusion as-is. I hope to brush it
>>> > up a bit, but I have quite a backlog of stuff to review, as well.
>>>
>>> We're still quite keen to get this patch included. Is there anything I can do
>>> to help?
>>
>> Yes, testing and review. :)
>>
>> I won't send the series out just yet, as I suspect it could use another
>> read-through on my part. But if you want to peek at it or try some
>> timings, it's available at:
>>
>>   https://github.com/peff/git jk/loose-cache
>
> Just a comment on this from the series:
>
>     Note that it is possible for this to actually be _slower_. We'll do a
>     full readdir() to fill the cache, so if you have a very large number of
>     loose objects and a very small number of lookups, that readdir() may end
>     up more expensive.
>
>     In practice, though, having a large number of loose objects is already a
>     performance problem, which should be fixed by repacking or pruning via
>     git-gc. So on balance, this should be a good tradeoff.
>
> Our biggest repo has a very large number of loose objects at any given
> time, but the vast majority of these are because gc *is* happening very
> frequently and the default expiry policy of 2wks is in effect.
>
> Having a large number of loose objects is not per-se a performance
> problem.
>
> It's a problem if you end up "faulting" to from packs to the loose
> object directory a lot because those objects are still reachable, but if
> they're not reachable that number can grow very large if your ref churn
> is large (so lots of expired loose object production).
>
> Anyway, the series per-se looks good to me. It's particularly nice to
> have some of the ODB cleanup + cleanup in fetch-pack.c
>
> Just wanted to note that in our default (reasonable) config we do
> produce scenarios where this change can still be somewhat pathological,
> so I'm still interested in disabling it entirely given the
> implausibility of what it's guarding against.

Some actual numbers for this for a fairly small repo on NFS, "cold"
cache (hadn't run it in a bit):

    $ time (find objects/?? -type f|wc -l)
    862
    real    0m1.927s

Warm cache:

    $ time (find objects/?? -type f|wc -l)
    872
    real    0m0.151s

Cold cache on a bigger monorepo:

    $ time (find objects/?? -type f|wc -l)
    real    0m4.336s

Warm cache on a bigger monorepo (more ref churn):

    $ time (find objects/?? -type f|wc -l)
    49746
    real    0m1.082s

This on a server where bulk sustained writes of large files are really
fast (up to 1GB/s). It's just these metadata ops that are slow.

I also get cold cache times of up to 6 seconds on:

    time (find $(ls -d objects/??|sort -R) -type f | wc -l)

As opposed max of ~4s without -R, so I suspect there may be some
client/server optimization where things are iterated over in recursive
glob order (pre-fetched?), whereas the cache will try to fill buckets is
it encounters loose objects, so iterate over objects/{00..ff} randomly.

I'm not really leading up to any point here I haven't made already. I
was just curious to try to find some upper bound of overhead if say a
pack with 512 objects is pushed. In that case it's very likely that we
need to fill at least 200/256 buckets.
Ævar Arnfjörð Bjarmason Nov. 9, 2018, 1:43 p.m. UTC | #28
On Wed, Nov 07 2018, Geert Jansen wrote:

> On Mon, Oct 29, 2018 at 07:27:39PM -0400, Jeff King wrote:
>
>> On Mon, Oct 29, 2018 at 08:36:07PM +0100, Ævar Arnfjörð Bjarmason wrote:
>> >  * Re-roll my 4 patch series to include the patch you have in
>> >    <20181027093300.GA23974@sigill.intra.peff.net>
>>
>> I don't think it's quite ready for inclusion as-is. I hope to brush it
>> up a bit, but I have quite a backlog of stuff to review, as well.
>
> We're still quite keen to get this patch included. Is there anything I can do
> to help?
>
> Also I just re-read your comments on maximum cache size. I think you were
> arguing both sides of the equation and I wasn't sure where you'd ended up. :)
> A larger cache size potentially takes more time to fill up especially on NFS
> while a smaller cache size obviously would less effective. That said a small
> cache is still effective for the "clone" case where the repo is empty.
>
> It also occurred to me that as a performance optimization your patch could read
> the the loose object directories in parallel using a thread pool. At least on
> Amazon EFS this should result in al almost linear performance increase. I'm not
> sure how much this would help for local file systems. In any case this may be
> best done as a follow-up patch (that I'd be happy to volunteer for).

I'm planning to re-submit mine with some minor changes after the great
Documentation/config* move lands.

As noted in
https://public-inbox.org/git/87bm7clf4o.fsf@evledraar.gmail.com/ and
https://public-inbox.org/git/87h8gq5zmc.fsf@evledraar.gmail.com/ I think
it's regardless of Jeff's optimization is. O(nothing) is always faster
than O(something), particularly (as explained in that E-Mail) on NFS.

You didn't answer my question in
https://public-inbox.org/git/20181030024925.GC8325@amazon.com/ about
whether for your purposes you're interested in this for something where
it needs to work out of the box on some random Amazon's customer's
"git", or if it's something in-house and you just don't want to turn off
collision checking. That would be useful to know.

I've turned on core.checkCollisions=false in production at our
site. Cloning for some large repositories went from ~200 minutes to ~5m,
and some pushes from ~5 minutes to ~10 seconds. Those numbers will be
very similar (but slightly higher, maybe 1-5 seconds higher in the
latter case) with Jeff's (depending on the push).
Duy Nguyen Nov. 9, 2018, 4:08 p.m. UTC | #29
On Fri, Nov 9, 2018 at 2:46 PM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> I'm planning to re-submit mine with some minor changes after the great
> Documentation/config* move lands.
>
> As noted in
> https://public-inbox.org/git/87bm7clf4o.fsf@evledraar.gmail.com/ and
> https://public-inbox.org/git/87h8gq5zmc.fsf@evledraar.gmail.com/ I think
> it's regardless of Jeff's optimization is. O(nothing) is always faster
> than O(something), particularly (as explained in that E-Mail) on NFS.

Is it really worth adding more code to maintain just to shave a couple
seconds (or a few percent clone time)?
Ævar Arnfjörð Bjarmason Nov. 10, 2018, 2:04 p.m. UTC | #30
On Fri, Nov 09 2018, Duy Nguyen wrote:

> On Fri, Nov 9, 2018 at 2:46 PM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>> I'm planning to re-submit mine with some minor changes after the great
>> Documentation/config* move lands.
>>
>> As noted in
>> https://public-inbox.org/git/87bm7clf4o.fsf@evledraar.gmail.com/ and
>> https://public-inbox.org/git/87h8gq5zmc.fsf@evledraar.gmail.com/ I think
>> it's regardless of Jeff's optimization is. O(nothing) is always faster
>> than O(something), particularly (as explained in that E-Mail) on NFS.
>
> Is it really worth adding more code to maintain just to shave a couple
> seconds (or a few percent clone time)?

Yeah I think so, because (in rough priority order):

a) The maintenance burden of carrying core.checkCollisions is trivial,
   and it's hard to imagine a scenario where it'll be difficult to
   selectively turn off some does_this_collide() function.

b) I think I need to worry more about a meteorite colliding with the
   datacenter than the threat this check is trying to guard against.

c) I think we should just turn it off by default on SHA-1, but don't
   expect that argument will carry the day. But I expect even those who
   think we still need it will have a hard time making that argument in
   the case of SHA-256. So having the codepath to disable it is helpful.

d) As shown in the linked E-Mails of mine you sometimes pay a 2-3 second
   *fixed* cost even for a very small (think ~100-200 objects) push/fetch
   that would otherwise take milliseconds with Jeff's version of this
   optimization (and not with mine). This can be a hundred/thousands of
   percent slowdown.

   Is that a big deal in itself in terms of absolute time spent? No. But
   I'm also thinking about this from the perspective of getting noise
   out of performance metrics. Some of this slowdown is also "user
   waiting for the terminal to be usable again" not just some machine
   somewhere wasting its own time.

e) As shown in the patch I have this direction as a very beneficial
   side-effect makes it much easier to repair corrupt
   repositories. Something I'm hoping to pursue even further. I've had
   cases where core.checkCollisions=false + stuff on top would have made
   repairing a broken repo much easier.

Anyway, I'm in no rush to send my patch. I'm happily using it in
production, but will wait for Jeff's be ready and to land before picking
it up again. Just wanted to do a braindump of the benefits.
Jeff King Nov. 12, 2018, 2:31 p.m. UTC | #31
On Thu, Nov 08, 2018 at 11:20:47PM +0100, Ævar Arnfjörð Bjarmason wrote:

> Just a comment on this from the series:
> 
>     Note that it is possible for this to actually be _slower_. We'll do a
>     full readdir() to fill the cache, so if you have a very large number of
>     loose objects and a very small number of lookups, that readdir() may end
>     up more expensive.
> 
>     In practice, though, having a large number of loose objects is already a
>     performance problem, which should be fixed by repacking or pruning via
>     git-gc. So on balance, this should be a good tradeoff.
> 
> Our biggest repo has a very large number of loose objects at any given
> time, but the vast majority of these are because gc *is* happening very
> frequently and the default expiry policy of 2wks is in effect.
> 
> Having a large number of loose objects is not per-se a performance
> problem.

Yes, you're right. I was trying not to get into the rabbit hole of
discussing theoretical tradeoffs, but it is worth addressing. I've
updated that commit message in the patches I'll send out momentarily.

-Peff
Jeff King Nov. 12, 2018, 2:34 p.m. UTC | #32
On Sat, Nov 10, 2018 at 03:04:35PM +0100, Ævar Arnfjörð Bjarmason wrote:

> d) As shown in the linked E-Mails of mine you sometimes pay a 2-3 second
>    *fixed* cost even for a very small (think ~100-200 objects) push/fetch
>    that would otherwise take milliseconds with Jeff's version of this
>    optimization (and not with mine). This can be a hundred/thousands of
>    percent slowdown.
> 
>    Is that a big deal in itself in terms of absolute time spent? No. But
>    I'm also thinking about this from the perspective of getting noise
>    out of performance metrics. Some of this slowdown is also "user
>    waiting for the terminal to be usable again" not just some machine
>    somewhere wasting its own time.

IMHO the ultimate end-game in this direction is still "don't have a
bunch of loose objects".

Right now this can legitimately happen due to unreachable-but-recent
objects being exploded out (or never packed in the first place). But I
hope in the long run that we'll actually put these into packs. That will
make this case faster _and_ avoid extra work during gc _and_ fix the
"whoops, we just ran gc but you still have a lot of objects" problem.

Which doesn't invalidate your other four points, of course. ;)

-Peff
Geert Jansen Nov. 12, 2018, 10:58 p.m. UTC | #33
On Fri, Nov 09, 2018 at 02:43:52PM +0100, Ævar Arnfjörð Bjarmason wrote:

> As noted in
> https://public-inbox.org/git/87bm7clf4o.fsf@evledraar.gmail.com/ and
> https://public-inbox.org/git/87h8gq5zmc.fsf@evledraar.gmail.com/ I think
> it's regardless of Jeff's optimization is. O(nothing) is always faster
> than O(something), particularly (as explained in that E-Mail) on NFS.
> 
> You didn't answer my question in
> https://public-inbox.org/git/20181030024925.GC8325@amazon.com/ about
> whether for your purposes you're interested in this for something where
> it needs to work out of the box on some random Amazon's customer's
> "git", or if it's something in-house and you just don't want to turn off
> collision checking. That would be useful to know.

The reason I started this thread is to optimize performance for AWS customers
that run git on EFS. Therefore, my preference is that git would be fast out of
the box on NFS/EFS but without having to disable collision checking
unconditionally (disabling it for empty repos is fine as that's a no-op
anyway).
diff mbox series

Patch

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 2004e25da..22b3d40fb 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -84,6 +84,7 @@  static int verbose;
 static int show_resolving_progress;
 static int show_stat;
 static int check_self_contained_and_connected;
+static int cloning;
 
 static struct progress *progress;
 
@@ -794,7 +795,7 @@  static void sha1_object(const void *data, struct object_entry *obj_entry,
 
 	assert(data || obj_entry);
 
-	if (startup_info->have_repository) {
+	if (startup_info->have_repository && !cloning) {
 		read_lock();
 		collision_test_needed =
 			has_sha1_file_with_flags(oid->hash, OBJECT_INFO_QUICK);
@@ -1705,6 +1706,8 @@  int cmd_index_pack(int argc, const char **argv, const char *prefix)
 				check_self_contained_and_connected = 1;
 			} else if (!strcmp(arg, "--fsck-objects")) {
 				do_fsck_object = 1;
+			} else if (!strcmp(arg, "--cloning")) {
+				cloning = 1;
 			} else if (!strcmp(arg, "--verify")) {
 				verify = 1;
 			} else if (!strcmp(arg, "--verify-stat")) {
diff --git a/fetch-pack.c b/fetch-pack.c
index b3ed7121b..c75bfb8aa 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -843,6 +843,8 @@  static int get_pack(struct fetch_pack_args *args,
 			argv_array_push(&cmd.args, "--check-self-contained-and-connected");
 		if (args->from_promisor)
 			argv_array_push(&cmd.args, "--promisor");
+		if (args->cloning)
+			argv_array_pushf(&cmd.args, "--cloning");
 	}
 	else {
 		cmd_name = "unpack-objects";