update-server-info: avoid needless overwrites

Message ID	20190511013455.5886-1-e@80x24.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <git-owner@kernel.org> From: Eric Wong <e@80x24.org> To: git@vger.kernel.org Subject: [PATCH] update-server-info: avoid needless overwrites Date: Sat, 11 May 2019 01:34:55 +0000 Message-Id: <20190511013455.5886-1-e@80x24.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: git-owner@vger.kernel.org Precedence: bulk
Series	update-server-info: avoid needless overwrites \| expand update-server-info: avoid needless overwrites [2/1] server-info: conditionally update on fetch

Eric Wong May 11, 2019, 1:34 a.m. UTC

Do not change the existing info/refs and objects/info/packs
files if they match the existing content on the filesystem.
This is intended to preserve mtime and make it easier for dumb
HTTP pollers to rely on the If-Modified-Since header.

Combined with stdio and kernel buffering; the kernel should be
able to avoid block layer writes and reduce wear.

Signed-off-by: Eric Wong <e@80x24.org>
---
 server-info.c                 | 61 +++++++++++++++++++++++++++++------
 t/t5200-update-server-info.sh | 41 +++++++++++++++++++++++
 2 files changed, 92 insertions(+), 10 deletions(-)
 create mode 100755 t/t5200-update-server-info.sh

Eric Sunshine May 11, 2019, 7:35 a.m. UTC | #1

On Fri, May 10, 2019 at 9:35 PM Eric Wong <e@80x24.org> wrote:
> Do not change the existing info/refs and objects/info/packs
> files if they match the existing content on the filesystem.
> This is intended to preserve mtime and make it easier for dumb
> HTTP pollers to rely on the If-Modified-Since header.
> [...]
> Signed-off-by: Eric Wong <e@80x24.org>
> ---
> diff --git a/server-info.c b/server-info.c
> @@ -6,37 +6,78 @@
> +static int files_differ(FILE *fp, const char *path)
> +{
> +       [...]
> +       struct strbuf tmp = STRBUF_INIT;
> +       [...]
> +       if (strbuf_fread(&tmp, (size_t)new_len, fp) != (size_t)new_len)
> +               return 1;

Although strbuf_fread() will release 'tmp' automatically if the read
actually fails, it won't do so otherwise. So, if it reads more or
fewer bytes than expected, for some reason, then this early return
will leak the strbuf, won't it?

> +       the_hash_algo->init_fn(&c);
> +       the_hash_algo->update_fn(&c, tmp.buf, tmp.len);
> +       the_hash_algo->final_fn(oid_new.hash, &c);
> +       strbuf_release(&tmp);
> +
> +       if (strbuf_read_file(&tmp, path, (size_t)st.st_size) < 0)
> +               return 1;

Same issue.

> diff --git a/t/t5200-update-server-info.sh b/t/t5200-update-server-info.sh
> @@ -0,0 +1,41 @@
> +test_description='Test git stash show configuration.'

Copy/paste error for test description?

Eric Wong May 11, 2019, 9:17 p.m. UTC | #2

Eric Wong <e@80x24.org> wrote:
> Combined with stdio and kernel buffering; the kernel should be
> able to avoid block layer writes and reduce wear.

Along the same lines, I'd like to get repack to stop recreating
identical packs at some point (but that could be months from now...)

Ævar Arnfjörð Bjarmason May 11, 2019, 11:37 p.m. UTC | #3

On Sat, May 11 2019, Eric Wong wrote:

> Do not change the existing info/refs and objects/info/packs
> files if they match the existing content on the filesystem.
> This is intended to preserve mtime and make it easier for dumb
> HTTP pollers to rely on the If-Modified-Since header.
>
> Combined with stdio and kernel buffering; the kernel should be
> able to avoid block layer writes and reduce wear.
>
> Signed-off-by: Eric Wong <e@80x24.org>
> ---
>  server-info.c                 | 61 +++++++++++++++++++++++++++++------
>  t/t5200-update-server-info.sh | 41 +++++++++++++++++++++++
>  2 files changed, 92 insertions(+), 10 deletions(-)
>  create mode 100755 t/t5200-update-server-info.sh
>
> diff --git a/server-info.c b/server-info.c
> index 41274d098b..34599e4817 100644
> --- a/server-info.c
> +++ b/server-info.c
> @@ -6,37 +6,78 @@
>  #include "tag.h"
>  #include "packfile.h"
>  #include "object-store.h"
> +#include "strbuf.h"
> +
> +static int files_differ(FILE *fp, const char *path)
> +{
> +	struct stat st;
> +	git_hash_ctx c;
> +	struct object_id oid_old, oid_new;
> +	struct strbuf tmp = STRBUF_INIT;
> +	long new_len = ftell(fp);
> +
> +	if (new_len < 0 || stat(path, &st) < 0)
> +		return 1;
> +	if (!S_ISREG(st.st_mode))
> +		return 1;
> +	if ((off_t)new_len != st.st_size)
> +		return 1;
> +
> +	rewind(fp);
> +	if (strbuf_fread(&tmp, (size_t)new_len, fp) != (size_t)new_len)
> +		return 1;
> +	the_hash_algo->init_fn(&c);
> +	the_hash_algo->update_fn(&c, tmp.buf, tmp.len);
> +	the_hash_algo->final_fn(oid_new.hash, &c);
> +	strbuf_release(&tmp);
> +
> +	if (strbuf_read_file(&tmp, path, (size_t)st.st_size) < 0)
> +		return 1;
> +	the_hash_algo->init_fn(&c);
> +	the_hash_algo->update_fn(&c, tmp.buf, tmp.len);
> +	the_hash_algo->final_fn(oid_old.hash, &c);
> +	strbuf_release(&tmp);
> +
> +	return hashcmp(oid_old.hash, oid_new.hash);
> +}

This way of doing it just seems so weirdly convoluted. Read them one at
a time, compute the SHA-1, just to see if they're different. Why not
something closer to a plain memcmp():

	static int files_differ(FILE *fp, const char *path)
	{
		struct strbuf old = STRBUF_INIT, new = STRBUF_INIT;
		long new_len = ftell(fp);
		int diff = 1;

		rewind(fp);
		if (strbuf_fread(&new, (size_t)new_len, fp) != (size_t)new_len)
			goto release_return;
		if (strbuf_read_file(&old, path, 0) < 0)
			goto release_return;

		diff = strbuf_cmp(&old, &new);

	release_return:
		strbuf_release(&old);
		strbuf_release(&new);

		return diff;
	}

I.e. optimze for code simplicity with something close to a dumb "cmp
--silent". I'm going to make the bold claim that nobody using "dumb
http" is going to be impacted by the performance of reading their
for-each-ref or for-each-pack dump, hence opting for not even bothing to
stat() to get the size before reading.

Because really, if we were *trying* to micro-optimize this for time or
memory use there's much better ways, e.g. reading the old file and
memcmp() as we go and stream the "generate" callback, but I just don't
see the point of trying in this case.

>  /*
>   * Create the file "path" by writing to a temporary file and renaming
>   * it into place. The contents of the file come from "generate", which
>   * should return non-zero if it encounters an error.
>   */
> -static int update_info_file(char *path, int (*generate)(FILE *))
> +static int update_info_file(char *path, int (*generate)(FILE *), int force)
>  {
>  	char *tmp = mkpathdup("%s_XXXXXX", path);

Unrelated to this, patch, but I hadn't thought about this nasty race
condition. We recommend users run this from the "post-update" (or
"post-receive") hook, and don't juggle the lock along with the ref
update, thus due to the vagaries of scheduling you can end up with two
concurrent ref updates where the "old" one wins.

But I guess that brings me back to something close to "nobody with that
sort of update rate is using 'dumb http'" :)

For this to work properly we'd need to take some sort of global "ref
update/pack update" lock, and I guess at that point this "cmp" case
would be a helper similar to commit_lock_file_to(),
i.e. a commit_lock_file_to_if_different().

>  	int ret = -1;
>  	int fd = -1;
>  	FILE *fp = NULL, *to_close;
> +	int do_update;
>
>  	safe_create_leading_directories(path);
>  	fd = git_mkstemp_mode(tmp, 0666);
>  	if (fd < 0)
>  		goto out;
> -	to_close = fp = fdopen(fd, "w");
> +	to_close = fp = fdopen(fd, "w+");
>  	if (!fp)
>  		goto out;
>  	fd = -1;
>  	ret = generate(fp);
>  	if (ret)
>  		goto out;
> +
> +	do_update = force || files_differ(fp, path);
> [...]
>
> -static int update_info_refs(void)
> +static int update_info_refs(int force)

So, I was going to say "shouldn't we update the docs?" which for --force
say "Update the info files from scratch.".

But reading through it that "from scratch" wording is from c743e6e3c0
("Add a link from update-server-info documentation to repository
layout.", 2005-09-04).

There it was a refrence to a bug since fixed in 60d0526aaa ("Unoptimize
info/refs creation.", 2005-09-14), and then removed from the docs in
c5fe5b6de9 ("Remove obsolete bug warning in man git-update-server-info",
2009-04-25).

Then in b3223761c8 ("update_info_refs(): drop unused force parameter",
2019-04-05) Jeff removed the unused-for-a-decade "force" param.

So having gone through the history I think we're better off just
dropping the --force argument entirely, or at least changing the
docs.

Before this change the only thing it was doing was pruning stuff we
haven't written since 2005-ish (see 3e15c67c90 ("server-info: throw away
T computation as well.", 2005-12-04)), rather than "detect if useless"
we should just write out the file again, and then skip if changed
(i.e. this logic).

That would also take care of the parse_pack_def() case by proxy, i.e. we
don't need some "is stale if pack is missing" special case if we just
always write out a new file (or not if it memcmp's the same as the "old"
one), we want to change the file if the packs are updated anyway

Eric Wong May 12, 2019, 12:38 a.m. UTC | #4

Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> On Sat, May 11 2019, Eric Wong wrote:
> > +static int files_differ(FILE *fp, const char *path)
> > +{
> > +	struct stat st;
> > +	git_hash_ctx c;
> > +	struct object_id oid_old, oid_new;
> > +	struct strbuf tmp = STRBUF_INIT;
> > +	long new_len = ftell(fp);
> > +
> > +	if (new_len < 0 || stat(path, &st) < 0)
> > +		return 1;
> > +	if (!S_ISREG(st.st_mode))
> > +		return 1;
> > +	if ((off_t)new_len != st.st_size)
> > +		return 1;
> > +
> > +	rewind(fp);
> > +	if (strbuf_fread(&tmp, (size_t)new_len, fp) != (size_t)new_len)
> > +		return 1;
> > +	the_hash_algo->init_fn(&c);
> > +	the_hash_algo->update_fn(&c, tmp.buf, tmp.len);
> > +	the_hash_algo->final_fn(oid_new.hash, &c);
> > +	strbuf_release(&tmp);
> > +
> > +	if (strbuf_read_file(&tmp, path, (size_t)st.st_size) < 0)
> > +		return 1;
> > +	the_hash_algo->init_fn(&c);
> > +	the_hash_algo->update_fn(&c, tmp.buf, tmp.len);
> > +	the_hash_algo->final_fn(oid_old.hash, &c);
> > +	strbuf_release(&tmp);
> > +
> > +	return hashcmp(oid_old.hash, oid_new.hash);
> > +}
> 
> This way of doing it just seems so weirdly convoluted. Read them one at
> a time, compute the SHA-1, just to see if they're different. Why not
> something closer to a plain memcmp():
> 
> 	static int files_differ(FILE *fp, const char *path)
> 	{
> 		struct strbuf old = STRBUF_INIT, new = STRBUF_INIT;
> 		long new_len = ftell(fp);
> 		int diff = 1;
> 
> 		rewind(fp);
> 		if (strbuf_fread(&new, (size_t)new_len, fp) != (size_t)new_len)
> 			goto release_return;
> 		if (strbuf_read_file(&old, path, 0) < 0)
> 			goto release_return;
> 
> 		diff = strbuf_cmp(&old, &new);
> 
> 	release_return:
> 		strbuf_release(&old);
> 		strbuf_release(&new);
> 
> 		return diff;
> 	}
> 
> I.e. optimze for code simplicity with something close to a dumb "cmp
> --silent". I'm going to make the bold claim that nobody using "dumb
> http" is going to be impacted by the performance of reading their
> for-each-ref or for-each-pack dump, hence opting for not even bothing to
> stat() to get the size before reading.

I've been trying to improve dumb HTTP for more cases; actually.
(since it's much cheaper than smart HTTP in server memory/CPU)

> Because really, if we were *trying* to micro-optimize this for time or
> memory use there's much better ways, e.g. reading the old file and
> memcmp() as we go and stream the "generate" callback, but I just don't
> see the point of trying in this case.

I was actually going towards that route; but wasn't sure if this
idea would be accepted at all (and I've been trying to stay away
from using non-scripting languages).

I don't slurping all of info/refs into memory at all; so maybe
a streaming memcmp of the existing file is worth doing...

> >  /*
> >   * Create the file "path" by writing to a temporary file and renaming
> >   * it into place. The contents of the file come from "generate", which
> >   * should return non-zero if it encounters an error.
> >   */
> > -static int update_info_file(char *path, int (*generate)(FILE *))
> > +static int update_info_file(char *path, int (*generate)(FILE *), int force)
> >  {
> >  	char *tmp = mkpathdup("%s_XXXXXX", path);
> 
> Unrelated to this, patch, but I hadn't thought about this nasty race
> condition. We recommend users run this from the "post-update" (or
> "post-receive") hook, and don't juggle the lock along with the ref
> update, thus due to the vagaries of scheduling you can end up with two
> concurrent ref updates where the "old" one wins.
> 
> But I guess that brings me back to something close to "nobody with that
> sort of update rate is using 'dumb http'" :)
> 
> For this to work properly we'd need to take some sort of global "ref
> update/pack update" lock, and I guess at that point this "cmp" case
> would be a helper similar to commit_lock_file_to(),
> i.e. a commit_lock_file_to_if_different().

Worth a separate patch, at some point, I think.  I'm not too
familiar with the existing locking in git, actually...

Along those lines, I think repack/gc should automatically
update objects/info/packs if the file already exists.

> >  	int ret = -1;
> >  	int fd = -1;
> >  	FILE *fp = NULL, *to_close;
> > +	int do_update;
> >
> >  	safe_create_leading_directories(path);
> >  	fd = git_mkstemp_mode(tmp, 0666);
> >  	if (fd < 0)
> >  		goto out;
> > -	to_close = fp = fdopen(fd, "w");
> > +	to_close = fp = fdopen(fd, "w+");
> >  	if (!fp)
> >  		goto out;
> >  	fd = -1;
> >  	ret = generate(fp);
> >  	if (ret)
> >  		goto out;
> > +
> > +	do_update = force || files_differ(fp, path);
> > [...]
> >
> > -static int update_info_refs(void)
> > +static int update_info_refs(int force)
> 
> So, I was going to say "shouldn't we update the docs?" which for --force
> say "Update the info files from scratch.".
> 
> But reading through it that "from scratch" wording is from c743e6e3c0
> ("Add a link from update-server-info documentation to repository
> layout.", 2005-09-04).

Yes, that wording is awkward and I can update it.  But maybe making
it undocumented is sufficient and would save us the trouble
of describing it :)

"--force" might be seen as a performance optimization for cases
where you're certain the result will differ, but I'm not
sure if that's worth mentioning in the manpage.

> There it was a refrence to a bug since fixed in 60d0526aaa ("Unoptimize
> info/refs creation.", 2005-09-14), and then removed from the docs in
> c5fe5b6de9 ("Remove obsolete bug warning in man git-update-server-info",
> 2009-04-25).
> 
> Then in b3223761c8 ("update_info_refs(): drop unused force parameter",
> 2019-04-05) Jeff removed the unused-for-a-decade "force" param.
> 
> So having gone through the history I think we're better off just
> dropping the --force argument entirely, or at least changing the
> docs.

I can update the docs, or make it undocumented.  Compatibility
from the command-line needs to remain in case there are scripts
using it.

Jeff King May 12, 2019, 4:08 a.m. UTC | #5

On Sun, May 12, 2019 at 01:37:55AM +0200, Ævar Arnfjörð Bjarmason wrote:

> This way of doing it just seems so weirdly convoluted. Read them one at
> a time, compute the SHA-1, just to see if they're different. Why not
> something closer to a plain memcmp():

FWIW, I had the exact same thought on reading the patch. Checking sizes
seems like an easy optimization and I don't mind it, but computing
hashes when we could just compare the bytes seems pointless. I'd have
expected to stream 4k blocks as we go.

> I.e. optimze for code simplicity with something close to a dumb "cmp
> --silent". I'm going to make the bold claim that nobody using "dumb
> http" is going to be impacted by the performance of reading their
> for-each-ref or for-each-pack dump, hence opting for not even bothing to
> stat() to get the size before reading.

You're probably right (especially because we'd just spent O(n) work
generating the candidate file). But note that I have seen pathological
cases where info/refs was gigabytes.

> >  /*
> >   * Create the file "path" by writing to a temporary file and renaming
> >   * it into place. The contents of the file come from "generate", which
> >   * should return non-zero if it encounters an error.
> >   */
> > -static int update_info_file(char *path, int (*generate)(FILE *))
> > +static int update_info_file(char *path, int (*generate)(FILE *), int force)
> >  {
> >  	char *tmp = mkpathdup("%s_XXXXXX", path);
> 
> Unrelated to this, patch, but I hadn't thought about this nasty race
> condition. We recommend users run this from the "post-update" (or
> "post-receive") hook, and don't juggle the lock along with the ref
> update, thus due to the vagaries of scheduling you can end up with two
> concurrent ref updates where the "old" one wins.
> 
> But I guess that brings me back to something close to "nobody with that
> sort of update rate is using 'dumb http'" :)
> 
> For this to work properly we'd need to take some sort of global "ref
> update/pack update" lock, and I guess at that point this "cmp" case
> would be a helper similar to commit_lock_file_to(),
> i.e. a commit_lock_file_to_if_different().

I don't think our usual dot-locking is great here. What does the
race-loser do when it can't take the lock? It can't just skip its
update, since it needs to make sure that its new pack is mentioned.

So it has to wait and poll until the other process finishes. I guess
maybe that isn't the end of the world.

> Then in b3223761c8 ("update_info_refs(): drop unused force parameter",
> 2019-04-05) Jeff removed the unused-for-a-decade "force" param.
> 
> So having gone through the history I think we're better off just
> dropping the --force argument entirely, or at least changing the
> docs.
> 
> Before this change the only thing it was doing was pruning stuff we
> haven't written since 2005-ish (see 3e15c67c90 ("server-info: throw away
> T computation as well.", 2005-12-04)), rather than "detect if useless"
> we should just write out the file again, and then skip if changed
> (i.e. this logic).

Yeah, my commit only ripped out the useless force parameter for
info/refs. For info/packs, there's still that weird "is is stale"
computation (which I fixed several bugs in). It's not entirely clear to
me if that can just go away, but I agree that if we can it's simpler and
more desirable to just generate the candidate result and see if it's
bit-for-bit identical or not.

I'm not entirely sure of all of the magic that "stale" check is trying
to accomplish. I think there's some bits in there that try to preserve
the existing ordering, but I don't know why anyone would care (and there
are so many cases where the ordering gets thrown out that I think
anybody who does care is likely to get disappointed).

-Peff

Ævar Arnfjörð Bjarmason May 12, 2019, 7:16 a.m. UTC | #6

On Sun, May 12 2019, Jeff King wrote:

> On Sun, May 12, 2019 at 01:37:55AM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> This way of doing it just seems so weirdly convoluted. Read them one at
>> a time, compute the SHA-1, just to see if they're different. Why not
>> something closer to a plain memcmp():
>
> FWIW, I had the exact same thought on reading the patch. Checking sizes
> seems like an easy optimization and I don't mind it, but computing
> hashes when we could just compare the bytes seems pointless. I'd have
> expected to stream 4k blocks as we go.
>
>> I.e. optimze for code simplicity with something close to a dumb "cmp
>> --silent". I'm going to make the bold claim that nobody using "dumb
>> http" is going to be impacted by the performance of reading their
>> for-each-ref or for-each-pack dump, hence opting for not even bothing to
>> stat() to get the size before reading.
>
> You're probably right (especially because we'd just spent O(n) work
> generating the candidate file). But note that I have seen pathological
> cases where info/refs was gigabytes.

Wouldn't those users be calling update-server-info from something like a
post-receive hook where they *know* the refs/packs just got updated?

Well, there is "transfer.unpackLimit" to contend with, but that's just
for "packs are same", not "refs are same", and that file's presumably
much smaller.

So I'd think that *if* this is a case where there's even a question of
the performance mattering here, and I'd assume "refs changed but size is
the same" is a common case (e.g. craploads or tags, but just "master"
pushed-to) we could make this an opt-in "--update-if-changed", because
the "if changed" is extra work, and we're already documenting that you
should be calling this on the "I know it changed" hook path.

I don't care even if we use the initial SHA-1 method. It just struck me
as odd to conflate the reasonable "don't update mtime for optimizing
dumb over-the-wire client" with "optimize resource use when updating".

>> >  /*
>> >   * Create the file "path" by writing to a temporary file and renaming
>> >   * it into place. The contents of the file come from "generate", which
>> >   * should return non-zero if it encounters an error.
>> >   */
>> > -static int update_info_file(char *path, int (*generate)(FILE *))
>> > +static int update_info_file(char *path, int (*generate)(FILE *), int force)
>> >  {
>> >  	char *tmp = mkpathdup("%s_XXXXXX", path);
>>
>> Unrelated to this, patch, but I hadn't thought about this nasty race
>> condition. We recommend users run this from the "post-update" (or
>> "post-receive") hook, and don't juggle the lock along with the ref
>> update, thus due to the vagaries of scheduling you can end up with two
>> concurrent ref updates where the "old" one wins.
>>
>> But I guess that brings me back to something close to "nobody with that
>> sort of update rate is using 'dumb http'" :)
>>
>> For this to work properly we'd need to take some sort of global "ref
>> update/pack update" lock, and I guess at that point this "cmp" case
>> would be a helper similar to commit_lock_file_to(),
>> i.e. a commit_lock_file_to_if_different().
>
> I don't think our usual dot-locking is great here. What does the
> race-loser do when it can't take the lock? It can't just skip its
> update, since it needs to make sure that its new pack is mentioned.

Right, I mean a *global* .git/I_AM_DOING_WRITES.lock, because there's no
way to square the ref backend's notion of per-ref ref lock enabling
concurrent pushes with update-server-info's desire to generate metadata
showing *all* the refs.

> So it has to wait and poll until the other process finishes. I guess
> maybe that isn't the end of the world.

If "its" is update-server-info this won't work. It's possible for two
update-server-info processes to be racing in such a way that their
for_each_ref() reads a ref at a given value that'll be updated 1
millisecond later, but to then have that update's update-server-info
"win" the race to update the info files (hypothetical locks on those
info files and all).

Thus the "info" files will be updated to old values, since we'll see we
need changes, but change things to the old values.

All things that *can* be dealt with in some clever ways, but I think
just further proof nobody's using this for anything serious :)

I.e. the "commit A happened before B" but also "commit B's post-* hooks
finished after A" is a thing that happens quite frequently (per my
logs).

>> Then in b3223761c8 ("update_info_refs(): drop unused force parameter",
>> 2019-04-05) Jeff removed the unused-for-a-decade "force" param.
>>
>> So having gone through the history I think we're better off just
>> dropping the --force argument entirely, or at least changing the
>> docs.
>>
>> Before this change the only thing it was doing was pruning stuff we
>> haven't written since 2005-ish (see 3e15c67c90 ("server-info: throw away
>> T computation as well.", 2005-12-04)), rather than "detect if useless"
>> we should just write out the file again, and then skip if changed
>> (i.e. this logic).
>
> Yeah, my commit only ripped out the useless force parameter for
> info/refs. For info/packs, there's still that weird "is is stale"
> computation (which I fixed several bugs in). It's not entirely clear to
> me if that can just go away, but I agree that if we can it's simpler and
> more desirable to just generate the candidate result and see if it's
> bit-for-bit identical or not.
>
> I'm not entirely sure of all of the magic that "stale" check is trying
> to accomplish. I think there's some bits in there that try to preserve
> the existing ordering, but I don't know why anyone would care (and there
> are so many cases where the ordering gets thrown out that I think
> anybody who does care is likely to get disappointed).

My reading of it is that it's premature optimization that can go away
(and most of it has already), for "it's cheap" and "if not it's called
from the 'I know I had an update'" hook case, as noted above.<

Jeff King May 14, 2019, 9:47 a.m. UTC | #7

On Sun, May 12, 2019 at 09:16:31AM +0200, Ævar Arnfjörð Bjarmason wrote:

> > You're probably right (especially because we'd just spent O(n) work
> > generating the candidate file). But note that I have seen pathological
> > cases where info/refs was gigabytes.
> 
> Wouldn't those users be calling update-server-info from something like a
> post-receive hook where they *know* the refs/packs just got updated?
>
> Well, there is "transfer.unpackLimit" to contend with, but that's just
> for "packs are same", not "refs are same", and that file's presumably
> much smaller.

Yeah, I think there's sort of an open question here of who is calling
update-server-info when nothing got updated. I think the only place we
call it automatically is via receive-pack, but I'd guess Eric runs it as
part of public-inbox scripts.

I agree that this is probably mostly about info/packs. Not every push
(or repo update) will create one, but every push _should_ be changing a
ref (if it succeeds at all).  And I'd guess Eric's interest comes from
the use of Git in public-inbox, which is going to make frequent but
small updates.

So this does seem like a really niche case; it avoids one single HTTP
request of a small that should generally be small (unless you're letting
your pack list grow really big, but I think there are other issues with
that) in a case that we know will generate a bunch of other HTTP traffic
(if somebody updated the server info, there was indeed a push, so you'd
get a refreshed info/refs and presumably the new loose objects).

That said, the logic to preserve the mtime is pretty straightforward, so
I don't mind it too much if Eric finds this really improves things for
him.

> > I don't think our usual dot-locking is great here. What does the
> > race-loser do when it can't take the lock? It can't just skip its
> > update, since it needs to make sure that its new pack is mentioned.
> 
> Right, I mean a *global* .git/I_AM_DOING_WRITES.lock, because there's no
> way to square the ref backend's notion of per-ref ref lock enabling
> concurrent pushes with update-server-info's desire to generate metadata
> showing *all* the refs.

OK. I agree that would work, but it's nasty to delay user-initiated
operations for ongoing maintenance (another obvious place to want such a
lock is for pruning, which can take quite a while).

> > So it has to wait and poll until the other process finishes. I guess
> > maybe that isn't the end of the world.
> 
> If "its" is update-server-info this won't work. It's possible for two
> update-server-info processes to be racing in such a way that their
> for_each_ref() reads a ref at a given value that'll be updated 1
> millisecond later, but to then have that update's update-server-info
> "win" the race to update the info files (hypothetical locks on those
> info files and all).
> 
> Thus the "info" files will be updated to old values, since we'll see we
> need changes, but change things to the old values.
> 
> All things that *can* be dealt with in some clever ways, but I think
> just further proof nobody's using this for anything serious :)
> 
> I.e. the "commit A happened before B" but also "commit B's post-* hooks
> finished after A" is a thing that happens quite frequently (per my
> logs).

I think it would work because any update-server-info, whether from A or
B, will take into account the full current repo state (and we don't look
at that state until we take the lock). So you might get an interleaved
"A-push, B-push, B-maint, A-maint", but that's OK. A-maint will
represent B's state when it runs.

> > I'm not entirely sure of all of the magic that "stale" check is trying
> > to accomplish. I think there's some bits in there that try to preserve
> > the existing ordering, but I don't know why anyone would care (and there
> > are so many cases where the ordering gets thrown out that I think
> > anybody who does care is likely to get disappointed).
> 
> My reading of it is that it's premature optimization that can go away
> (and most of it has already), for "it's cheap" and "if not it's called
> from the 'I know I had an update'" hook case, as noted above.<

That's my reading, too, but I didn't want to be responsible for
regressing some obscure case. At least Eric seems to _use_
update-server-info. ;)

-Peff

Ævar Arnfjörð Bjarmason May 14, 2019, 10:33 a.m. UTC | #8

On Tue, May 14 2019, Jeff King wrote:

> On Sun, May 12, 2019 at 09:16:31AM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> > You're probably right (especially because we'd just spent O(n) work
>> > generating the candidate file). But note that I have seen pathological
>> > cases where info/refs was gigabytes.
>>
>> Wouldn't those users be calling update-server-info from something like a
>> post-receive hook where they *know* the refs/packs just got updated?
>>
>> Well, there is "transfer.unpackLimit" to contend with, but that's just
>> for "packs are same", not "refs are same", and that file's presumably
>> much smaller.
>
> Yeah, I think there's sort of an open question here of who is calling
> update-server-info when nothing got updated. I think the only place we
> call it automatically is via receive-pack, but I'd guess Eric runs it as
> part of public-inbox scripts.
>
> I agree that this is probably mostly about info/packs. Not every push
> (or repo update) will create one, but every push _should_ be changing a
> ref (if it succeeds at all).  And I'd guess Eric's interest comes from
> the use of Git in public-inbox, which is going to make frequent but
> small updates.
>
> So this does seem like a really niche case; it avoids one single HTTP
> request of a small that should generally be small (unless you're letting
> your pack list grow really big, but I think there are other issues with
> that) in a case that we know will generate a bunch of other HTTP traffic
> (if somebody updated the server info, there was indeed a push, so you'd
> get a refreshed info/refs and presumably the new loose objects).
>
> That said, the logic to preserve the mtime is pretty straightforward, so
> I don't mind it too much if Eric finds this really improves things for
> him.
>
>> > I don't think our usual dot-locking is great here. What does the
>> > race-loser do when it can't take the lock? It can't just skip its
>> > update, since it needs to make sure that its new pack is mentioned.
>>
>> Right, I mean a *global* .git/I_AM_DOING_WRITES.lock, because there's no
>> way to square the ref backend's notion of per-ref ref lock enabling
>> concurrent pushes with update-server-info's desire to generate metadata
>> showing *all* the refs.
>
> OK. I agree that would work, but it's nasty to delay user-initiated
> operations for ongoing maintenance (another obvious place to want such a
> lock is for pruning, which can take quite a while).
>
>> > So it has to wait and poll until the other process finishes. I guess
>> > maybe that isn't the end of the world.
>>
>> If "its" is update-server-info this won't work. It's possible for two
>> update-server-info processes to be racing in such a way that their
>> for_each_ref() reads a ref at a given value that'll be updated 1
>> millisecond later, but to then have that update's update-server-info
>> "win" the race to update the info files (hypothetical locks on those
>> info files and all).
>>
>> Thus the "info" files will be updated to old values, since we'll see we
>> need changes, but change things to the old values.
>>
>> All things that *can* be dealt with in some clever ways, but I think
>> just further proof nobody's using this for anything serious :)
>>
>> I.e. the "commit A happened before B" but also "commit B's post-* hooks
>> finished after A" is a thing that happens quite frequently (per my
>> logs).
>
> I think it would work because any update-server-info, whether from A or
> B, will take into account the full current repo state (and we don't look
> at that state until we take the lock). So you might get an interleaved
> "A-push, B-push, B-maint, A-maint", but that's OK. A-maint will
> represent B's state when it runs.

Maybe we're talking about different things. I mean the following
sequence:

 1. Refs "X" and "Y" are at X=A Y=A
 2. Concurrent push #1 happens, updating X from A..F
 3. Concurrent push #2 happens, updating Y from A..F
 4. Concurrent push #1 succeeds
 5. Concurrent push #1 starts update-server-info. Reads X=F Y=A
 5. Concurrent push #2 succeeds
 6. Concurrent push #2 starts update-server-info. Reads X=F Y=F
 7. Concurrent push #2's update-server-info finishes, X=F Y=F written to "info"
 8. Concurrent push #1's update-server-info finishes, X=A Y=F written to "info"

I.e. because we have per-ref locks and no lock at all on
update-server-info (but that would need to be a global ref lock, not
just on the "info" files) we can have a push that's already read "X"'s
value as "A" while updating "Y" win the race against an
update-server-info that updated "X"'s value to "F".

It will get fixed on the next push (at least as far as "X"'s value
goes), but until that time dumb clients will falsely see that "X" hasn't
been updated.

>> > I'm not entirely sure of all of the magic that "stale" check is trying
>> > to accomplish. I think there's some bits in there that try to preserve
>> > the existing ordering, but I don't know why anyone would care (and there
>> > are so many cases where the ordering gets thrown out that I think
>> > anybody who does care is likely to get disappointed).
>>
>> My reading of it is that it's premature optimization that can go away
>> (and most of it has already), for "it's cheap" and "if not it's called
>> from the 'I know I had an update'" hook case, as noted above.<
>
> That's my reading, too, but I didn't want to be responsible for
> regressing some obscure case. At least Eric seems to _use_
> update-server-info. ;)

Yeah I agree with that sentiment. I don't use it. I was just wondering
if for those who still want it this wasn't a relatively obscure
use-case, so it should perhaps be hidden behind an option if there were
optimization concerns.

Jeff King May 14, 2019, 11:24 a.m. UTC | #9

On Tue, May 14, 2019 at 12:33:11PM +0200, Ævar Arnfjörð Bjarmason wrote:

> > I think it would work because any update-server-info, whether from A or
> > B, will take into account the full current repo state (and we don't look
> > at that state until we take the lock). So you might get an interleaved
> > "A-push, B-push, B-maint, A-maint", but that's OK. A-maint will
> > represent B's state when it runs.
> 
> Maybe we're talking about different things. I mean the following
> sequence:
> 
>  1. Refs "X" and "Y" are at X=A Y=A
>  2. Concurrent push #1 happens, updating X from A..F
>  3. Concurrent push #2 happens, updating Y from A..F
>  4. Concurrent push #1 succeeds
>  5. Concurrent push #1 starts update-server-info. Reads X=F Y=A
>  5. Concurrent push #2 succeeds
>  6. Concurrent push #2 starts update-server-info. Reads X=F Y=F
>  7. Concurrent push #2's update-server-info finishes, X=F Y=F written to "info"
>  8. Concurrent push #1's update-server-info finishes, X=A Y=F written to "info"
> 
> I.e. because we have per-ref locks and no lock at all on
> update-server-info (but that would need to be a global ref lock, not
> just on the "info" files) we can have a push that's already read "X"'s
> value as "A" while updating "Y" win the race against an
> update-server-info that updated "X"'s value to "F".
> 
> It will get fixed on the next push (at least as far as "X"'s value
> goes), but until that time dumb clients will falsely see that "X" hasn't
> been updated.

That's the same situation. But I thought we were talking about having an
update-server-info lock. In which case the #2 update-server-info or the
#1 update-server-info runs in its entirety, and cannot have their read
and write steps interleaved (that's what I meant by "don't look at the
state until we take the lock"). Then that gives us a strict ordering: we
know that _some_ update-server-info (be it #1 or #2's) will run after
any given update.

-Peff

Eric Wong May 14, 2019, 11:50 a.m. UTC | #10

Jeff King <peff@peff.net> wrote:
> Yeah, I think there's sort of an open question here of who is calling
> update-server-info when nothing got updated. I think the only place we
> call it automatically is via receive-pack, but I'd guess Eric runs it as
> part of public-inbox scripts.

Correct.  post-update doesn't get run because public-inbox
relies on fast-import.  I have mirrors using "git fetch", which
also doesn't call post-update, either so I was calling
update-server-info in my mirror script.

Since more people have taken an interest in mirroring things,
I figured I'd make "public-inbox-index" (the script which
updates the Xapian and SQLite indices) call update-server-info
itself.

That way, it's simpler to mirror (v1) inboxes:

  git fetch && git update-server-info && public-inbox-index

becomes:

  git fetch && public-inbox-index

That's a huge savings in cognitive overhead.

So, my eventual goal for this is we get to the point where any
git operation which changes refs will automatically update
info/refs if it exists.

Ditto for objects/info/packs on any pack changes.

This, like my bitmap-by-default change is among the things
I'm trying to make it easier for more users to start
self-hosting (including mirroring) any type of git repo.

Anyways, I am far from knowledgeable about the locking
discussion for git, though :x

> That's my reading, too, but I didn't want to be responsible for
> regressing some obscure case. At least Eric seems to _use_
> update-server-info. ;)

I also have something else on my mind for abusing info files with :>
(another email)

Ævar Arnfjörð Bjarmason May 14, 2019, 11:57 a.m. UTC | #11

On Tue, May 14 2019, Jeff King wrote:

> On Tue, May 14, 2019 at 12:33:11PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
>> > I think it would work because any update-server-info, whether from A or
>> > B, will take into account the full current repo state (and we don't look
>> > at that state until we take the lock). So you might get an interleaved
>> > "A-push, B-push, B-maint, A-maint", but that's OK. A-maint will
>> > represent B's state when it runs.
>>
>> Maybe we're talking about different things. I mean the following
>> sequence:
>>
>>  1. Refs "X" and "Y" are at X=A Y=A
>>  2. Concurrent push #1 happens, updating X from A..F
>>  3. Concurrent push #2 happens, updating Y from A..F
>>  4. Concurrent push #1 succeeds
>>  5. Concurrent push #1 starts update-server-info. Reads X=F Y=A
>>  5. Concurrent push #2 succeeds
>>  6. Concurrent push #2 starts update-server-info. Reads X=F Y=F
>>  7. Concurrent push #2's update-server-info finishes, X=F Y=F written to "info"
>>  8. Concurrent push #1's update-server-info finishes, X=A Y=F written to "info"
>>
>> I.e. because we have per-ref locks and no lock at all on
>> update-server-info (but that would need to be a global ref lock, not
>> just on the "info" files) we can have a push that's already read "X"'s
>> value as "A" while updating "Y" win the race against an
>> update-server-info that updated "X"'s value to "F".
>>
>> It will get fixed on the next push (at least as far as "X"'s value
>> goes), but until that time dumb clients will falsely see that "X" hasn't
>> been updated.
>
> That's the same situation. But I thought we were talking about having an
> update-server-info lock. In which case the #2 update-server-info or the
> #1 update-server-info runs in its entirety, and cannot have their read
> and write steps interleaved (that's what I meant by "don't look at the
> state until we take the lock"). Then that gives us a strict ordering: we
> know that _some_ update-server-info (be it #1 or #2's) will run after
> any given update.

Yeah you're right. I *thought* in my last E-mail we were talking about
the current state, but re-reading upthread I see that was a fail on my
part.

An update-server-info lock would solve this indeed. We could still end
up with a situation where whatever a naïve version of the lockfile API
would fail for the "new" update since the old one was underway, so we'd
need something similar to core.*Ref*Timeout, but if we ran into a *.lock
or the timeout we could exit non-zero, as opposed to silently failing
like it does now when it races.

Eric Wong May 14, 2019, 12:13 p.m. UTC | #12

Eric Wong <e@80x24.org> wrote:
> Jeff King <peff@peff.net> wrote:
> > That's my reading, too, but I didn't want to be responsible for
> > regressing some obscure case. At least Eric seems to _use_
> > update-server-info. ;)
> 
> I also have something else on my mind for abusing info files with :>
> (another email)

I'm not sure when/if I'll have time for this; but this ought to
be possible:

	GIT_DIR=$HTTP_URL git <any read-only command>

And possible without existing admins to setup or change
anything on their server.

Right now, I could do it by setting up a WebDAV server
and using fusedav[1] on the client.

But, not everybody runs a WebDAV server which allows PROPFIND
for listing files...  However, info/refs and objects/info/packs
can give us all the info we need without needing PROPFIND.  All
we'd need is the common GET/HEAD HTTP methods for read-only
access.

git doesn't need mmap; and curl + Range requests ought to be
able to get us what we need to emulate pread.  It'd be great for
low-latency LANs, maybe not so great with high latency; but
probably better in many cases than cloning a giant repo to cat
one blob.

Also, cloning on a static bundle ought to be doable with:

	git clone $REMOTE_OR_LOCAL_PATH/foo.bundle

And yeah, it also sucks that bundles double storage overhead
for admins; it would be nice if I could use bundles as alternates
or packs...

Anyways, all of this is probably a lot of work and I don't hack
much, anymore.

[1] I have many patches for fusedav, and the debian maintainer
    seems dead, and upstream's moved on: https://bugs.debian.org/fusedav
    davfs2 can't do Range requests, so it won't work for big repos...

Ævar Arnfjörð Bjarmason May 14, 2019, 12:19 p.m. UTC | #13

On Tue, May 14 2019, Eric Wong wrote:

> Jeff King <peff@peff.net> wrote:
>> Yeah, I think there's sort of an open question here of who is calling
>> update-server-info when nothing got updated. I think the only place we
>> call it automatically is via receive-pack, but I'd guess Eric runs it as
>> part of public-inbox scripts.
>
> Correct.  post-update doesn't get run because public-inbox
> relies on fast-import.  I have mirrors using "git fetch", which
> also doesn't call post-update, either so I was calling
> update-server-info in my mirror script.
>
> Since more people have taken an interest in mirroring things,
> I figured I'd make "public-inbox-index" (the script which
> updates the Xapian and SQLite indices) call update-server-info
> itself.
>
> That way, it's simpler to mirror (v1) inboxes:
>
>   git fetch && git update-server-info && public-inbox-index
>
> becomes:
>
>   git fetch && public-inbox-index
>
> That's a huge savings in cognitive overhead.
>
> So, my eventual goal for this is we get to the point where any
> git operation which changes refs will automatically update
> info/refs if it exists.
>
> Ditto for objects/info/packs on any pack changes.
>
> This, like my bitmap-by-default change is among the things
> I'm trying to make it easier for more users to start
> self-hosting (including mirroring) any type of git repo.
>
> Anyways, I am far from knowledgeable about the locking
> discussion for git, though :x
>
>> That's my reading, too, but I didn't want to be responsible for
>> regressing some obscure case. At least Eric seems to _use_
>> update-server-info. ;)
>
> I also have something else on my mind for abusing info files with :>
> (another email)

Aside from this change, I wonder if making "fetch" optionally "exit 1"
if no refs were updated would be useful, as in the below WIP. Of course
it would be better to distinguish errors from "no refs to update".

I've hacked around it not doing that in the past for similar "fetch and
index" things by looking at for-each-ref before/after, or only seeing if
HEAD changed (so this wouldn't be a general solution...):

diff --git a/builtin/fetch.c b/builtin/fetch.c
index 4ba63d5ac6..da5414d9db 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -47,6 +47,7 @@ static int prune_tags = -1; /* unspecified */
 #define PRUNE_TAGS_BY_DEFAULT 0 /* do we prune tags by default? */

 static int all, append, dry_run, force, keep, multiple, update_head_ok, verbosity, deepen_relative;
+static int exit_code;
 static int progress = -1;
 static int tags = TAGS_DEFAULT, unshallow, update_shallow, deepen;
 static int max_children = 1;
@@ -66,6 +67,7 @@ static struct refspec refmap = REFSPEC_INIT_FETCH;
 static struct list_objects_filter_options filter_options;
 static struct string_list server_options = STRING_LIST_INIT_DUP;
 static struct string_list negotiation_tip = STRING_LIST_INIT_NODUP;
+static int updated_refs;

 static int git_fetch_config(const char *k, const char *v, void *cb)
 {
@@ -133,6 +135,8 @@ static struct option builtin_fetch_options[] = {
 	{ OPTION_CALLBACK, 0, "recurse-submodules", &recurse_submodules, N_("on-demand"),
 		    N_("control recursive fetching of submodules"),
 		    PARSE_OPT_OPTARG, option_fetch_parse_recurse_submodules },
+	OPT_BOOL(0, "exit-code", &exit_code,
+		 N_("exit successfully if refs are updated")),
 	OPT_BOOL(0, "dry-run", &dry_run,
 		 N_("dry run")),
 	OPT_BOOL('k', "keep", &keep, N_("keep downloaded pack")),
@@ -522,8 +526,10 @@ static int s_update_ref(const char *action,
 	struct strbuf err = STRBUF_INIT;
 	int ret, df_conflict = 0;

-	if (dry_run)
+	if (dry_run) {
+		updated_refs++;
 		return 0;
+	}
 	if (!rla)
 		rla = default_rla.buf;
 	msg = xstrfmt("%s: %s", rla, action);
@@ -545,6 +551,7 @@ static int s_update_ref(const char *action,
 	ref_transaction_free(transaction);
 	strbuf_release(&err);
 	free(msg);
+	updated_refs++;
 	return 0;
 fail:
 	ref_transaction_free(transaction);
@@ -1680,5 +1687,9 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
 	run_command_v_opt(argv_gc_auto.argv, RUN_GIT_CMD);
 	argv_array_clear(&argv_gc_auto);

-	return result;
+	if (result)
+		return result;
+	if (exit_code)
+		return !updated_refs;
+	return 0;
 }

Jeff King May 14, 2019, 12:27 p.m. UTC | #14

On Tue, May 14, 2019 at 12:13:50PM +0000, Eric Wong wrote:

> I'm not sure when/if I'll have time for this; but this ought to
> be possible:
> 
> 	GIT_DIR=$HTTP_URL git <any read-only command>
> 
> And possible without existing admins to setup or change
> anything on their server.
> [...]
> git doesn't need mmap; and curl + Range requests ought to be
> able to get us what we need to emulate pread.  It'd be great for
> low-latency LANs, maybe not so great with high latency; but
> probably better in many cases than cloning a giant repo to cat
> one blob.

My first thought here is that we could probably just use the filesystem
boundary as the appropriate layer, and have an httpfs fuse driver. Your
mention of fusedav makes me think you've already gone this route.

I guess non-dav http doesn't necessarily let us enumerate directories.
But it might be enough if we could insert ourselves in a few key spots.
E.g., when we try to opendir("objects/packs") and it fails with ENOSYS
or similar, then we fallback to opening "objects/info/packs" to get the
same information (and ditto for loose ref traversal).

There'd still be _some_ work in Git, but it wouldn't require replacing
every single read with HTTP magic.

> Also, cloning on a static bundle ought to be doable with:
> 
> 	git clone $REMOTE_OR_LOCAL_PATH/foo.bundle

Some nearby discussion:

  https://public-inbox.org/git/20190514092900.GA11679@sigill.intra.peff.net/

> And yeah, it also sucks that bundles double storage overhead
> for admins; it would be nice if I could use bundles as alternates
> or packs...

Yes. It's nice that they're a single file for some uses, but in terms of
implementation it would be easier if the bundle files just said "here
are some refs, and you can find my packfile in the relative filename
XYZ.pack". And then you'd just store the two of them together (and a
matching .idx if you wanted to use it as a real pack, but the bundle
readers wouldn't care).

It probably wouldn't be that hard to wedge that into the bundle format
by bumping the version field.

-Peff

Jeff King May 14, 2019, 12:29 p.m. UTC | #15

On Tue, May 14, 2019 at 02:19:36PM +0200, Ævar Arnfjörð Bjarmason wrote:

> Aside from this change, I wonder if making "fetch" optionally "exit 1"
> if no refs were updated would be useful, as in the below WIP. Of course
> it would be better to distinguish errors from "no refs to update".

That seems very sensible to me, as long as it's an optional flag as you
have it here.

> @@ -133,6 +135,8 @@ static struct option builtin_fetch_options[] = {
>  	{ OPTION_CALLBACK, 0, "recurse-submodules", &recurse_submodules, N_("on-demand"),
>  		    N_("control recursive fetching of submodules"),
>  		    PARSE_OPT_OPTARG, option_fetch_parse_recurse_submodules },
> +	OPT_BOOL(0, "exit-code", &exit_code,
> +		 N_("exit successfully if refs are updated")),

I think it makes sense to flip this explanation: "exit non-zero if no
refs are updated" or "treat it as an error if no refs are updated".
Since the default behavior is already to exit successfully in this case. :)

We may also want to advertise a particular result code so callers can
distinguish it from regular die() errors.

-Peff

update-server-info: avoid needless overwrites

Commit Message

Comments

Patch