Message ID | 20190511013455.5886-1-e@80x24.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | update-server-info: avoid needless overwrites | expand |
On Fri, May 10, 2019 at 9:35 PM Eric Wong <e@80x24.org> wrote: > Do not change the existing info/refs and objects/info/packs > files if they match the existing content on the filesystem. > This is intended to preserve mtime and make it easier for dumb > HTTP pollers to rely on the If-Modified-Since header. > [...] > Signed-off-by: Eric Wong <e@80x24.org> > --- > diff --git a/server-info.c b/server-info.c > @@ -6,37 +6,78 @@ > +static int files_differ(FILE *fp, const char *path) > +{ > + [...] > + struct strbuf tmp = STRBUF_INIT; > + [...] > + if (strbuf_fread(&tmp, (size_t)new_len, fp) != (size_t)new_len) > + return 1; Although strbuf_fread() will release 'tmp' automatically if the read actually fails, it won't do so otherwise. So, if it reads more or fewer bytes than expected, for some reason, then this early return will leak the strbuf, won't it? > + the_hash_algo->init_fn(&c); > + the_hash_algo->update_fn(&c, tmp.buf, tmp.len); > + the_hash_algo->final_fn(oid_new.hash, &c); > + strbuf_release(&tmp); > + > + if (strbuf_read_file(&tmp, path, (size_t)st.st_size) < 0) > + return 1; Same issue. > diff --git a/t/t5200-update-server-info.sh b/t/t5200-update-server-info.sh > @@ -0,0 +1,41 @@ > +test_description='Test git stash show configuration.' Copy/paste error for test description?
Eric Wong <e@80x24.org> wrote: > Combined with stdio and kernel buffering; the kernel should be > able to avoid block layer writes and reduce wear. Along the same lines, I'd like to get repack to stop recreating identical packs at some point (but that could be months from now...)
On Sat, May 11 2019, Eric Wong wrote: > Do not change the existing info/refs and objects/info/packs > files if they match the existing content on the filesystem. > This is intended to preserve mtime and make it easier for dumb > HTTP pollers to rely on the If-Modified-Since header. > > Combined with stdio and kernel buffering; the kernel should be > able to avoid block layer writes and reduce wear. > > Signed-off-by: Eric Wong <e@80x24.org> > --- > server-info.c | 61 +++++++++++++++++++++++++++++------ > t/t5200-update-server-info.sh | 41 +++++++++++++++++++++++ > 2 files changed, 92 insertions(+), 10 deletions(-) > create mode 100755 t/t5200-update-server-info.sh > > diff --git a/server-info.c b/server-info.c > index 41274d098b..34599e4817 100644 > --- a/server-info.c > +++ b/server-info.c > @@ -6,37 +6,78 @@ > #include "tag.h" > #include "packfile.h" > #include "object-store.h" > +#include "strbuf.h" > + > +static int files_differ(FILE *fp, const char *path) > +{ > + struct stat st; > + git_hash_ctx c; > + struct object_id oid_old, oid_new; > + struct strbuf tmp = STRBUF_INIT; > + long new_len = ftell(fp); > + > + if (new_len < 0 || stat(path, &st) < 0) > + return 1; > + if (!S_ISREG(st.st_mode)) > + return 1; > + if ((off_t)new_len != st.st_size) > + return 1; > + > + rewind(fp); > + if (strbuf_fread(&tmp, (size_t)new_len, fp) != (size_t)new_len) > + return 1; > + the_hash_algo->init_fn(&c); > + the_hash_algo->update_fn(&c, tmp.buf, tmp.len); > + the_hash_algo->final_fn(oid_new.hash, &c); > + strbuf_release(&tmp); > + > + if (strbuf_read_file(&tmp, path, (size_t)st.st_size) < 0) > + return 1; > + the_hash_algo->init_fn(&c); > + the_hash_algo->update_fn(&c, tmp.buf, tmp.len); > + the_hash_algo->final_fn(oid_old.hash, &c); > + strbuf_release(&tmp); > + > + return hashcmp(oid_old.hash, oid_new.hash); > +} This way of doing it just seems so weirdly convoluted. Read them one at a time, compute the SHA-1, just to see if they're different. Why not something closer to a plain memcmp(): static int files_differ(FILE *fp, const char *path) { struct strbuf old = STRBUF_INIT, new = STRBUF_INIT; long new_len = ftell(fp); int diff = 1; rewind(fp); if (strbuf_fread(&new, (size_t)new_len, fp) != (size_t)new_len) goto release_return; if (strbuf_read_file(&old, path, 0) < 0) goto release_return; diff = strbuf_cmp(&old, &new); release_return: strbuf_release(&old); strbuf_release(&new); return diff; } I.e. optimze for code simplicity with something close to a dumb "cmp --silent". I'm going to make the bold claim that nobody using "dumb http" is going to be impacted by the performance of reading their for-each-ref or for-each-pack dump, hence opting for not even bothing to stat() to get the size before reading. Because really, if we were *trying* to micro-optimize this for time or memory use there's much better ways, e.g. reading the old file and memcmp() as we go and stream the "generate" callback, but I just don't see the point of trying in this case. > /* > * Create the file "path" by writing to a temporary file and renaming > * it into place. The contents of the file come from "generate", which > * should return non-zero if it encounters an error. > */ > -static int update_info_file(char *path, int (*generate)(FILE *)) > +static int update_info_file(char *path, int (*generate)(FILE *), int force) > { > char *tmp = mkpathdup("%s_XXXXXX", path); Unrelated to this, patch, but I hadn't thought about this nasty race condition. We recommend users run this from the "post-update" (or "post-receive") hook, and don't juggle the lock along with the ref update, thus due to the vagaries of scheduling you can end up with two concurrent ref updates where the "old" one wins. But I guess that brings me back to something close to "nobody with that sort of update rate is using 'dumb http'" :) For this to work properly we'd need to take some sort of global "ref update/pack update" lock, and I guess at that point this "cmp" case would be a helper similar to commit_lock_file_to(), i.e. a commit_lock_file_to_if_different(). > int ret = -1; > int fd = -1; > FILE *fp = NULL, *to_close; > + int do_update; > > safe_create_leading_directories(path); > fd = git_mkstemp_mode(tmp, 0666); > if (fd < 0) > goto out; > - to_close = fp = fdopen(fd, "w"); > + to_close = fp = fdopen(fd, "w+"); > if (!fp) > goto out; > fd = -1; > ret = generate(fp); > if (ret) > goto out; > + > + do_update = force || files_differ(fp, path); > [...] > > -static int update_info_refs(void) > +static int update_info_refs(int force) So, I was going to say "shouldn't we update the docs?" which for --force say "Update the info files from scratch.". But reading through it that "from scratch" wording is from c743e6e3c0 ("Add a link from update-server-info documentation to repository layout.", 2005-09-04). There it was a refrence to a bug since fixed in 60d0526aaa ("Unoptimize info/refs creation.", 2005-09-14), and then removed from the docs in c5fe5b6de9 ("Remove obsolete bug warning in man git-update-server-info", 2009-04-25). Then in b3223761c8 ("update_info_refs(): drop unused force parameter", 2019-04-05) Jeff removed the unused-for-a-decade "force" param. So having gone through the history I think we're better off just dropping the --force argument entirely, or at least changing the docs. Before this change the only thing it was doing was pruning stuff we haven't written since 2005-ish (see 3e15c67c90 ("server-info: throw away T computation as well.", 2005-12-04)), rather than "detect if useless" we should just write out the file again, and then skip if changed (i.e. this logic). That would also take care of the parse_pack_def() case by proxy, i.e. we don't need some "is stale if pack is missing" special case if we just always write out a new file (or not if it memcmp's the same as the "old" one), we want to change the file if the packs are updated anyway
Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > On Sat, May 11 2019, Eric Wong wrote: > > +static int files_differ(FILE *fp, const char *path) > > +{ > > + struct stat st; > > + git_hash_ctx c; > > + struct object_id oid_old, oid_new; > > + struct strbuf tmp = STRBUF_INIT; > > + long new_len = ftell(fp); > > + > > + if (new_len < 0 || stat(path, &st) < 0) > > + return 1; > > + if (!S_ISREG(st.st_mode)) > > + return 1; > > + if ((off_t)new_len != st.st_size) > > + return 1; > > + > > + rewind(fp); > > + if (strbuf_fread(&tmp, (size_t)new_len, fp) != (size_t)new_len) > > + return 1; > > + the_hash_algo->init_fn(&c); > > + the_hash_algo->update_fn(&c, tmp.buf, tmp.len); > > + the_hash_algo->final_fn(oid_new.hash, &c); > > + strbuf_release(&tmp); > > + > > + if (strbuf_read_file(&tmp, path, (size_t)st.st_size) < 0) > > + return 1; > > + the_hash_algo->init_fn(&c); > > + the_hash_algo->update_fn(&c, tmp.buf, tmp.len); > > + the_hash_algo->final_fn(oid_old.hash, &c); > > + strbuf_release(&tmp); > > + > > + return hashcmp(oid_old.hash, oid_new.hash); > > +} > > This way of doing it just seems so weirdly convoluted. Read them one at > a time, compute the SHA-1, just to see if they're different. Why not > something closer to a plain memcmp(): > > static int files_differ(FILE *fp, const char *path) > { > struct strbuf old = STRBUF_INIT, new = STRBUF_INIT; > long new_len = ftell(fp); > int diff = 1; > > rewind(fp); > if (strbuf_fread(&new, (size_t)new_len, fp) != (size_t)new_len) > goto release_return; > if (strbuf_read_file(&old, path, 0) < 0) > goto release_return; > > diff = strbuf_cmp(&old, &new); > > release_return: > strbuf_release(&old); > strbuf_release(&new); > > return diff; > } > > I.e. optimze for code simplicity with something close to a dumb "cmp > --silent". I'm going to make the bold claim that nobody using "dumb > http" is going to be impacted by the performance of reading their > for-each-ref or for-each-pack dump, hence opting for not even bothing to > stat() to get the size before reading. I've been trying to improve dumb HTTP for more cases; actually. (since it's much cheaper than smart HTTP in server memory/CPU) > Because really, if we were *trying* to micro-optimize this for time or > memory use there's much better ways, e.g. reading the old file and > memcmp() as we go and stream the "generate" callback, but I just don't > see the point of trying in this case. I was actually going towards that route; but wasn't sure if this idea would be accepted at all (and I've been trying to stay away from using non-scripting languages). I don't slurping all of info/refs into memory at all; so maybe a streaming memcmp of the existing file is worth doing... > > /* > > * Create the file "path" by writing to a temporary file and renaming > > * it into place. The contents of the file come from "generate", which > > * should return non-zero if it encounters an error. > > */ > > -static int update_info_file(char *path, int (*generate)(FILE *)) > > +static int update_info_file(char *path, int (*generate)(FILE *), int force) > > { > > char *tmp = mkpathdup("%s_XXXXXX", path); > > Unrelated to this, patch, but I hadn't thought about this nasty race > condition. We recommend users run this from the "post-update" (or > "post-receive") hook, and don't juggle the lock along with the ref > update, thus due to the vagaries of scheduling you can end up with two > concurrent ref updates where the "old" one wins. > > But I guess that brings me back to something close to "nobody with that > sort of update rate is using 'dumb http'" :) > > For this to work properly we'd need to take some sort of global "ref > update/pack update" lock, and I guess at that point this "cmp" case > would be a helper similar to commit_lock_file_to(), > i.e. a commit_lock_file_to_if_different(). Worth a separate patch, at some point, I think. I'm not too familiar with the existing locking in git, actually... Along those lines, I think repack/gc should automatically update objects/info/packs if the file already exists. > > int ret = -1; > > int fd = -1; > > FILE *fp = NULL, *to_close; > > + int do_update; > > > > safe_create_leading_directories(path); > > fd = git_mkstemp_mode(tmp, 0666); > > if (fd < 0) > > goto out; > > - to_close = fp = fdopen(fd, "w"); > > + to_close = fp = fdopen(fd, "w+"); > > if (!fp) > > goto out; > > fd = -1; > > ret = generate(fp); > > if (ret) > > goto out; > > + > > + do_update = force || files_differ(fp, path); > > [...] > > > > -static int update_info_refs(void) > > +static int update_info_refs(int force) > > So, I was going to say "shouldn't we update the docs?" which for --force > say "Update the info files from scratch.". > > But reading through it that "from scratch" wording is from c743e6e3c0 > ("Add a link from update-server-info documentation to repository > layout.", 2005-09-04). Yes, that wording is awkward and I can update it. But maybe making it undocumented is sufficient and would save us the trouble of describing it :) "--force" might be seen as a performance optimization for cases where you're certain the result will differ, but I'm not sure if that's worth mentioning in the manpage. > There it was a refrence to a bug since fixed in 60d0526aaa ("Unoptimize > info/refs creation.", 2005-09-14), and then removed from the docs in > c5fe5b6de9 ("Remove obsolete bug warning in man git-update-server-info", > 2009-04-25). > > Then in b3223761c8 ("update_info_refs(): drop unused force parameter", > 2019-04-05) Jeff removed the unused-for-a-decade "force" param. > > So having gone through the history I think we're better off just > dropping the --force argument entirely, or at least changing the > docs. I can update the docs, or make it undocumented. Compatibility from the command-line needs to remain in case there are scripts using it.
On Sun, May 12, 2019 at 01:37:55AM +0200, Ævar Arnfjörð Bjarmason wrote: > This way of doing it just seems so weirdly convoluted. Read them one at > a time, compute the SHA-1, just to see if they're different. Why not > something closer to a plain memcmp(): FWIW, I had the exact same thought on reading the patch. Checking sizes seems like an easy optimization and I don't mind it, but computing hashes when we could just compare the bytes seems pointless. I'd have expected to stream 4k blocks as we go. > I.e. optimze for code simplicity with something close to a dumb "cmp > --silent". I'm going to make the bold claim that nobody using "dumb > http" is going to be impacted by the performance of reading their > for-each-ref or for-each-pack dump, hence opting for not even bothing to > stat() to get the size before reading. You're probably right (especially because we'd just spent O(n) work generating the candidate file). But note that I have seen pathological cases where info/refs was gigabytes. > > /* > > * Create the file "path" by writing to a temporary file and renaming > > * it into place. The contents of the file come from "generate", which > > * should return non-zero if it encounters an error. > > */ > > -static int update_info_file(char *path, int (*generate)(FILE *)) > > +static int update_info_file(char *path, int (*generate)(FILE *), int force) > > { > > char *tmp = mkpathdup("%s_XXXXXX", path); > > Unrelated to this, patch, but I hadn't thought about this nasty race > condition. We recommend users run this from the "post-update" (or > "post-receive") hook, and don't juggle the lock along with the ref > update, thus due to the vagaries of scheduling you can end up with two > concurrent ref updates where the "old" one wins. > > But I guess that brings me back to something close to "nobody with that > sort of update rate is using 'dumb http'" :) > > For this to work properly we'd need to take some sort of global "ref > update/pack update" lock, and I guess at that point this "cmp" case > would be a helper similar to commit_lock_file_to(), > i.e. a commit_lock_file_to_if_different(). I don't think our usual dot-locking is great here. What does the race-loser do when it can't take the lock? It can't just skip its update, since it needs to make sure that its new pack is mentioned. So it has to wait and poll until the other process finishes. I guess maybe that isn't the end of the world. > Then in b3223761c8 ("update_info_refs(): drop unused force parameter", > 2019-04-05) Jeff removed the unused-for-a-decade "force" param. > > So having gone through the history I think we're better off just > dropping the --force argument entirely, or at least changing the > docs. > > Before this change the only thing it was doing was pruning stuff we > haven't written since 2005-ish (see 3e15c67c90 ("server-info: throw away > T computation as well.", 2005-12-04)), rather than "detect if useless" > we should just write out the file again, and then skip if changed > (i.e. this logic). Yeah, my commit only ripped out the useless force parameter for info/refs. For info/packs, there's still that weird "is is stale" computation (which I fixed several bugs in). It's not entirely clear to me if that can just go away, but I agree that if we can it's simpler and more desirable to just generate the candidate result and see if it's bit-for-bit identical or not. I'm not entirely sure of all of the magic that "stale" check is trying to accomplish. I think there's some bits in there that try to preserve the existing ordering, but I don't know why anyone would care (and there are so many cases where the ordering gets thrown out that I think anybody who does care is likely to get disappointed). -Peff
On Sun, May 12 2019, Jeff King wrote: > On Sun, May 12, 2019 at 01:37:55AM +0200, Ævar Arnfjörð Bjarmason wrote: > >> This way of doing it just seems so weirdly convoluted. Read them one at >> a time, compute the SHA-1, just to see if they're different. Why not >> something closer to a plain memcmp(): > > FWIW, I had the exact same thought on reading the patch. Checking sizes > seems like an easy optimization and I don't mind it, but computing > hashes when we could just compare the bytes seems pointless. I'd have > expected to stream 4k blocks as we go. > >> I.e. optimze for code simplicity with something close to a dumb "cmp >> --silent". I'm going to make the bold claim that nobody using "dumb >> http" is going to be impacted by the performance of reading their >> for-each-ref or for-each-pack dump, hence opting for not even bothing to >> stat() to get the size before reading. > > You're probably right (especially because we'd just spent O(n) work > generating the candidate file). But note that I have seen pathological > cases where info/refs was gigabytes. Wouldn't those users be calling update-server-info from something like a post-receive hook where they *know* the refs/packs just got updated? Well, there is "transfer.unpackLimit" to contend with, but that's just for "packs are same", not "refs are same", and that file's presumably much smaller. So I'd think that *if* this is a case where there's even a question of the performance mattering here, and I'd assume "refs changed but size is the same" is a common case (e.g. craploads or tags, but just "master" pushed-to) we could make this an opt-in "--update-if-changed", because the "if changed" is extra work, and we're already documenting that you should be calling this on the "I know it changed" hook path. I don't care even if we use the initial SHA-1 method. It just struck me as odd to conflate the reasonable "don't update mtime for optimizing dumb over-the-wire client" with "optimize resource use when updating". >> > /* >> > * Create the file "path" by writing to a temporary file and renaming >> > * it into place. The contents of the file come from "generate", which >> > * should return non-zero if it encounters an error. >> > */ >> > -static int update_info_file(char *path, int (*generate)(FILE *)) >> > +static int update_info_file(char *path, int (*generate)(FILE *), int force) >> > { >> > char *tmp = mkpathdup("%s_XXXXXX", path); >> >> Unrelated to this, patch, but I hadn't thought about this nasty race >> condition. We recommend users run this from the "post-update" (or >> "post-receive") hook, and don't juggle the lock along with the ref >> update, thus due to the vagaries of scheduling you can end up with two >> concurrent ref updates where the "old" one wins. >> >> But I guess that brings me back to something close to "nobody with that >> sort of update rate is using 'dumb http'" :) >> >> For this to work properly we'd need to take some sort of global "ref >> update/pack update" lock, and I guess at that point this "cmp" case >> would be a helper similar to commit_lock_file_to(), >> i.e. a commit_lock_file_to_if_different(). > > I don't think our usual dot-locking is great here. What does the > race-loser do when it can't take the lock? It can't just skip its > update, since it needs to make sure that its new pack is mentioned. Right, I mean a *global* .git/I_AM_DOING_WRITES.lock, because there's no way to square the ref backend's notion of per-ref ref lock enabling concurrent pushes with update-server-info's desire to generate metadata showing *all* the refs. > So it has to wait and poll until the other process finishes. I guess > maybe that isn't the end of the world. If "its" is update-server-info this won't work. It's possible for two update-server-info processes to be racing in such a way that their for_each_ref() reads a ref at a given value that'll be updated 1 millisecond later, but to then have that update's update-server-info "win" the race to update the info files (hypothetical locks on those info files and all). Thus the "info" files will be updated to old values, since we'll see we need changes, but change things to the old values. All things that *can* be dealt with in some clever ways, but I think just further proof nobody's using this for anything serious :) I.e. the "commit A happened before B" but also "commit B's post-* hooks finished after A" is a thing that happens quite frequently (per my logs). >> Then in b3223761c8 ("update_info_refs(): drop unused force parameter", >> 2019-04-05) Jeff removed the unused-for-a-decade "force" param. >> >> So having gone through the history I think we're better off just >> dropping the --force argument entirely, or at least changing the >> docs. >> >> Before this change the only thing it was doing was pruning stuff we >> haven't written since 2005-ish (see 3e15c67c90 ("server-info: throw away >> T computation as well.", 2005-12-04)), rather than "detect if useless" >> we should just write out the file again, and then skip if changed >> (i.e. this logic). > > Yeah, my commit only ripped out the useless force parameter for > info/refs. For info/packs, there's still that weird "is is stale" > computation (which I fixed several bugs in). It's not entirely clear to > me if that can just go away, but I agree that if we can it's simpler and > more desirable to just generate the candidate result and see if it's > bit-for-bit identical or not. > > I'm not entirely sure of all of the magic that "stale" check is trying > to accomplish. I think there's some bits in there that try to preserve > the existing ordering, but I don't know why anyone would care (and there > are so many cases where the ordering gets thrown out that I think > anybody who does care is likely to get disappointed). My reading of it is that it's premature optimization that can go away (and most of it has already), for "it's cheap" and "if not it's called from the 'I know I had an update'" hook case, as noted above.<
On Sun, May 12, 2019 at 09:16:31AM +0200, Ævar Arnfjörð Bjarmason wrote: > > You're probably right (especially because we'd just spent O(n) work > > generating the candidate file). But note that I have seen pathological > > cases where info/refs was gigabytes. > > Wouldn't those users be calling update-server-info from something like a > post-receive hook where they *know* the refs/packs just got updated? > > Well, there is "transfer.unpackLimit" to contend with, but that's just > for "packs are same", not "refs are same", and that file's presumably > much smaller. Yeah, I think there's sort of an open question here of who is calling update-server-info when nothing got updated. I think the only place we call it automatically is via receive-pack, but I'd guess Eric runs it as part of public-inbox scripts. I agree that this is probably mostly about info/packs. Not every push (or repo update) will create one, but every push _should_ be changing a ref (if it succeeds at all). And I'd guess Eric's interest comes from the use of Git in public-inbox, which is going to make frequent but small updates. So this does seem like a really niche case; it avoids one single HTTP request of a small that should generally be small (unless you're letting your pack list grow really big, but I think there are other issues with that) in a case that we know will generate a bunch of other HTTP traffic (if somebody updated the server info, there was indeed a push, so you'd get a refreshed info/refs and presumably the new loose objects). That said, the logic to preserve the mtime is pretty straightforward, so I don't mind it too much if Eric finds this really improves things for him. > > I don't think our usual dot-locking is great here. What does the > > race-loser do when it can't take the lock? It can't just skip its > > update, since it needs to make sure that its new pack is mentioned. > > Right, I mean a *global* .git/I_AM_DOING_WRITES.lock, because there's no > way to square the ref backend's notion of per-ref ref lock enabling > concurrent pushes with update-server-info's desire to generate metadata > showing *all* the refs. OK. I agree that would work, but it's nasty to delay user-initiated operations for ongoing maintenance (another obvious place to want such a lock is for pruning, which can take quite a while). > > So it has to wait and poll until the other process finishes. I guess > > maybe that isn't the end of the world. > > If "its" is update-server-info this won't work. It's possible for two > update-server-info processes to be racing in such a way that their > for_each_ref() reads a ref at a given value that'll be updated 1 > millisecond later, but to then have that update's update-server-info > "win" the race to update the info files (hypothetical locks on those > info files and all). > > Thus the "info" files will be updated to old values, since we'll see we > need changes, but change things to the old values. > > All things that *can* be dealt with in some clever ways, but I think > just further proof nobody's using this for anything serious :) > > I.e. the "commit A happened before B" but also "commit B's post-* hooks > finished after A" is a thing that happens quite frequently (per my > logs). I think it would work because any update-server-info, whether from A or B, will take into account the full current repo state (and we don't look at that state until we take the lock). So you might get an interleaved "A-push, B-push, B-maint, A-maint", but that's OK. A-maint will represent B's state when it runs. > > I'm not entirely sure of all of the magic that "stale" check is trying > > to accomplish. I think there's some bits in there that try to preserve > > the existing ordering, but I don't know why anyone would care (and there > > are so many cases where the ordering gets thrown out that I think > > anybody who does care is likely to get disappointed). > > My reading of it is that it's premature optimization that can go away > (and most of it has already), for "it's cheap" and "if not it's called > from the 'I know I had an update'" hook case, as noted above.< That's my reading, too, but I didn't want to be responsible for regressing some obscure case. At least Eric seems to _use_ update-server-info. ;) -Peff
On Tue, May 14 2019, Jeff King wrote: > On Sun, May 12, 2019 at 09:16:31AM +0200, Ævar Arnfjörð Bjarmason wrote: > >> > You're probably right (especially because we'd just spent O(n) work >> > generating the candidate file). But note that I have seen pathological >> > cases where info/refs was gigabytes. >> >> Wouldn't those users be calling update-server-info from something like a >> post-receive hook where they *know* the refs/packs just got updated? >> >> Well, there is "transfer.unpackLimit" to contend with, but that's just >> for "packs are same", not "refs are same", and that file's presumably >> much smaller. > > Yeah, I think there's sort of an open question here of who is calling > update-server-info when nothing got updated. I think the only place we > call it automatically is via receive-pack, but I'd guess Eric runs it as > part of public-inbox scripts. > > I agree that this is probably mostly about info/packs. Not every push > (or repo update) will create one, but every push _should_ be changing a > ref (if it succeeds at all). And I'd guess Eric's interest comes from > the use of Git in public-inbox, which is going to make frequent but > small updates. > > So this does seem like a really niche case; it avoids one single HTTP > request of a small that should generally be small (unless you're letting > your pack list grow really big, but I think there are other issues with > that) in a case that we know will generate a bunch of other HTTP traffic > (if somebody updated the server info, there was indeed a push, so you'd > get a refreshed info/refs and presumably the new loose objects). > > That said, the logic to preserve the mtime is pretty straightforward, so > I don't mind it too much if Eric finds this really improves things for > him. > >> > I don't think our usual dot-locking is great here. What does the >> > race-loser do when it can't take the lock? It can't just skip its >> > update, since it needs to make sure that its new pack is mentioned. >> >> Right, I mean a *global* .git/I_AM_DOING_WRITES.lock, because there's no >> way to square the ref backend's notion of per-ref ref lock enabling >> concurrent pushes with update-server-info's desire to generate metadata >> showing *all* the refs. > > OK. I agree that would work, but it's nasty to delay user-initiated > operations for ongoing maintenance (another obvious place to want such a > lock is for pruning, which can take quite a while). > >> > So it has to wait and poll until the other process finishes. I guess >> > maybe that isn't the end of the world. >> >> If "its" is update-server-info this won't work. It's possible for two >> update-server-info processes to be racing in such a way that their >> for_each_ref() reads a ref at a given value that'll be updated 1 >> millisecond later, but to then have that update's update-server-info >> "win" the race to update the info files (hypothetical locks on those >> info files and all). >> >> Thus the "info" files will be updated to old values, since we'll see we >> need changes, but change things to the old values. >> >> All things that *can* be dealt with in some clever ways, but I think >> just further proof nobody's using this for anything serious :) >> >> I.e. the "commit A happened before B" but also "commit B's post-* hooks >> finished after A" is a thing that happens quite frequently (per my >> logs). > > I think it would work because any update-server-info, whether from A or > B, will take into account the full current repo state (and we don't look > at that state until we take the lock). So you might get an interleaved > "A-push, B-push, B-maint, A-maint", but that's OK. A-maint will > represent B's state when it runs. Maybe we're talking about different things. I mean the following sequence: 1. Refs "X" and "Y" are at X=A Y=A 2. Concurrent push #1 happens, updating X from A..F 3. Concurrent push #2 happens, updating Y from A..F 4. Concurrent push #1 succeeds 5. Concurrent push #1 starts update-server-info. Reads X=F Y=A 5. Concurrent push #2 succeeds 6. Concurrent push #2 starts update-server-info. Reads X=F Y=F 7. Concurrent push #2's update-server-info finishes, X=F Y=F written to "info" 8. Concurrent push #1's update-server-info finishes, X=A Y=F written to "info" I.e. because we have per-ref locks and no lock at all on update-server-info (but that would need to be a global ref lock, not just on the "info" files) we can have a push that's already read "X"'s value as "A" while updating "Y" win the race against an update-server-info that updated "X"'s value to "F". It will get fixed on the next push (at least as far as "X"'s value goes), but until that time dumb clients will falsely see that "X" hasn't been updated. >> > I'm not entirely sure of all of the magic that "stale" check is trying >> > to accomplish. I think there's some bits in there that try to preserve >> > the existing ordering, but I don't know why anyone would care (and there >> > are so many cases where the ordering gets thrown out that I think >> > anybody who does care is likely to get disappointed). >> >> My reading of it is that it's premature optimization that can go away >> (and most of it has already), for "it's cheap" and "if not it's called >> from the 'I know I had an update'" hook case, as noted above.< > > That's my reading, too, but I didn't want to be responsible for > regressing some obscure case. At least Eric seems to _use_ > update-server-info. ;) Yeah I agree with that sentiment. I don't use it. I was just wondering if for those who still want it this wasn't a relatively obscure use-case, so it should perhaps be hidden behind an option if there were optimization concerns.
On Tue, May 14, 2019 at 12:33:11PM +0200, Ævar Arnfjörð Bjarmason wrote: > > I think it would work because any update-server-info, whether from A or > > B, will take into account the full current repo state (and we don't look > > at that state until we take the lock). So you might get an interleaved > > "A-push, B-push, B-maint, A-maint", but that's OK. A-maint will > > represent B's state when it runs. > > Maybe we're talking about different things. I mean the following > sequence: > > 1. Refs "X" and "Y" are at X=A Y=A > 2. Concurrent push #1 happens, updating X from A..F > 3. Concurrent push #2 happens, updating Y from A..F > 4. Concurrent push #1 succeeds > 5. Concurrent push #1 starts update-server-info. Reads X=F Y=A > 5. Concurrent push #2 succeeds > 6. Concurrent push #2 starts update-server-info. Reads X=F Y=F > 7. Concurrent push #2's update-server-info finishes, X=F Y=F written to "info" > 8. Concurrent push #1's update-server-info finishes, X=A Y=F written to "info" > > I.e. because we have per-ref locks and no lock at all on > update-server-info (but that would need to be a global ref lock, not > just on the "info" files) we can have a push that's already read "X"'s > value as "A" while updating "Y" win the race against an > update-server-info that updated "X"'s value to "F". > > It will get fixed on the next push (at least as far as "X"'s value > goes), but until that time dumb clients will falsely see that "X" hasn't > been updated. That's the same situation. But I thought we were talking about having an update-server-info lock. In which case the #2 update-server-info or the #1 update-server-info runs in its entirety, and cannot have their read and write steps interleaved (that's what I meant by "don't look at the state until we take the lock"). Then that gives us a strict ordering: we know that _some_ update-server-info (be it #1 or #2's) will run after any given update. -Peff
Jeff King <peff@peff.net> wrote: > Yeah, I think there's sort of an open question here of who is calling > update-server-info when nothing got updated. I think the only place we > call it automatically is via receive-pack, but I'd guess Eric runs it as > part of public-inbox scripts. Correct. post-update doesn't get run because public-inbox relies on fast-import. I have mirrors using "git fetch", which also doesn't call post-update, either so I was calling update-server-info in my mirror script. Since more people have taken an interest in mirroring things, I figured I'd make "public-inbox-index" (the script which updates the Xapian and SQLite indices) call update-server-info itself. That way, it's simpler to mirror (v1) inboxes: git fetch && git update-server-info && public-inbox-index becomes: git fetch && public-inbox-index That's a huge savings in cognitive overhead. So, my eventual goal for this is we get to the point where any git operation which changes refs will automatically update info/refs if it exists. Ditto for objects/info/packs on any pack changes. This, like my bitmap-by-default change is among the things I'm trying to make it easier for more users to start self-hosting (including mirroring) any type of git repo. Anyways, I am far from knowledgeable about the locking discussion for git, though :x > That's my reading, too, but I didn't want to be responsible for > regressing some obscure case. At least Eric seems to _use_ > update-server-info. ;) I also have something else on my mind for abusing info files with :> (another email)
On Tue, May 14 2019, Jeff King wrote: > On Tue, May 14, 2019 at 12:33:11PM +0200, Ævar Arnfjörð Bjarmason wrote: > >> > I think it would work because any update-server-info, whether from A or >> > B, will take into account the full current repo state (and we don't look >> > at that state until we take the lock). So you might get an interleaved >> > "A-push, B-push, B-maint, A-maint", but that's OK. A-maint will >> > represent B's state when it runs. >> >> Maybe we're talking about different things. I mean the following >> sequence: >> >> 1. Refs "X" and "Y" are at X=A Y=A >> 2. Concurrent push #1 happens, updating X from A..F >> 3. Concurrent push #2 happens, updating Y from A..F >> 4. Concurrent push #1 succeeds >> 5. Concurrent push #1 starts update-server-info. Reads X=F Y=A >> 5. Concurrent push #2 succeeds >> 6. Concurrent push #2 starts update-server-info. Reads X=F Y=F >> 7. Concurrent push #2's update-server-info finishes, X=F Y=F written to "info" >> 8. Concurrent push #1's update-server-info finishes, X=A Y=F written to "info" >> >> I.e. because we have per-ref locks and no lock at all on >> update-server-info (but that would need to be a global ref lock, not >> just on the "info" files) we can have a push that's already read "X"'s >> value as "A" while updating "Y" win the race against an >> update-server-info that updated "X"'s value to "F". >> >> It will get fixed on the next push (at least as far as "X"'s value >> goes), but until that time dumb clients will falsely see that "X" hasn't >> been updated. > > That's the same situation. But I thought we were talking about having an > update-server-info lock. In which case the #2 update-server-info or the > #1 update-server-info runs in its entirety, and cannot have their read > and write steps interleaved (that's what I meant by "don't look at the > state until we take the lock"). Then that gives us a strict ordering: we > know that _some_ update-server-info (be it #1 or #2's) will run after > any given update. Yeah you're right. I *thought* in my last E-mail we were talking about the current state, but re-reading upthread I see that was a fail on my part. An update-server-info lock would solve this indeed. We could still end up with a situation where whatever a naïve version of the lockfile API would fail for the "new" update since the old one was underway, so we'd need something similar to core.*Ref*Timeout, but if we ran into a *.lock or the timeout we could exit non-zero, as opposed to silently failing like it does now when it races.
Eric Wong <e@80x24.org> wrote: > Jeff King <peff@peff.net> wrote: > > That's my reading, too, but I didn't want to be responsible for > > regressing some obscure case. At least Eric seems to _use_ > > update-server-info. ;) > > I also have something else on my mind for abusing info files with :> > (another email) I'm not sure when/if I'll have time for this; but this ought to be possible: GIT_DIR=$HTTP_URL git <any read-only command> And possible without existing admins to setup or change anything on their server. Right now, I could do it by setting up a WebDAV server and using fusedav[1] on the client. But, not everybody runs a WebDAV server which allows PROPFIND for listing files... However, info/refs and objects/info/packs can give us all the info we need without needing PROPFIND. All we'd need is the common GET/HEAD HTTP methods for read-only access. git doesn't need mmap; and curl + Range requests ought to be able to get us what we need to emulate pread. It'd be great for low-latency LANs, maybe not so great with high latency; but probably better in many cases than cloning a giant repo to cat one blob. Also, cloning on a static bundle ought to be doable with: git clone $REMOTE_OR_LOCAL_PATH/foo.bundle And yeah, it also sucks that bundles double storage overhead for admins; it would be nice if I could use bundles as alternates or packs... Anyways, all of this is probably a lot of work and I don't hack much, anymore. [1] I have many patches for fusedav, and the debian maintainer seems dead, and upstream's moved on: https://bugs.debian.org/fusedav davfs2 can't do Range requests, so it won't work for big repos...
On Tue, May 14 2019, Eric Wong wrote: > Jeff King <peff@peff.net> wrote: >> Yeah, I think there's sort of an open question here of who is calling >> update-server-info when nothing got updated. I think the only place we >> call it automatically is via receive-pack, but I'd guess Eric runs it as >> part of public-inbox scripts. > > Correct. post-update doesn't get run because public-inbox > relies on fast-import. I have mirrors using "git fetch", which > also doesn't call post-update, either so I was calling > update-server-info in my mirror script. > > Since more people have taken an interest in mirroring things, > I figured I'd make "public-inbox-index" (the script which > updates the Xapian and SQLite indices) call update-server-info > itself. > > That way, it's simpler to mirror (v1) inboxes: > > git fetch && git update-server-info && public-inbox-index > > becomes: > > git fetch && public-inbox-index > > That's a huge savings in cognitive overhead. > > So, my eventual goal for this is we get to the point where any > git operation which changes refs will automatically update > info/refs if it exists. > > Ditto for objects/info/packs on any pack changes. > > This, like my bitmap-by-default change is among the things > I'm trying to make it easier for more users to start > self-hosting (including mirroring) any type of git repo. > > Anyways, I am far from knowledgeable about the locking > discussion for git, though :x > >> That's my reading, too, but I didn't want to be responsible for >> regressing some obscure case. At least Eric seems to _use_ >> update-server-info. ;) > > I also have something else on my mind for abusing info files with :> > (another email) Aside from this change, I wonder if making "fetch" optionally "exit 1" if no refs were updated would be useful, as in the below WIP. Of course it would be better to distinguish errors from "no refs to update". I've hacked around it not doing that in the past for similar "fetch and index" things by looking at for-each-ref before/after, or only seeing if HEAD changed (so this wouldn't be a general solution...): diff --git a/builtin/fetch.c b/builtin/fetch.c index 4ba63d5ac6..da5414d9db 100644 --- a/builtin/fetch.c +++ b/builtin/fetch.c @@ -47,6 +47,7 @@ static int prune_tags = -1; /* unspecified */ #define PRUNE_TAGS_BY_DEFAULT 0 /* do we prune tags by default? */ static int all, append, dry_run, force, keep, multiple, update_head_ok, verbosity, deepen_relative; +static int exit_code; static int progress = -1; static int tags = TAGS_DEFAULT, unshallow, update_shallow, deepen; static int max_children = 1; @@ -66,6 +67,7 @@ static struct refspec refmap = REFSPEC_INIT_FETCH; static struct list_objects_filter_options filter_options; static struct string_list server_options = STRING_LIST_INIT_DUP; static struct string_list negotiation_tip = STRING_LIST_INIT_NODUP; +static int updated_refs; static int git_fetch_config(const char *k, const char *v, void *cb) { @@ -133,6 +135,8 @@ static struct option builtin_fetch_options[] = { { OPTION_CALLBACK, 0, "recurse-submodules", &recurse_submodules, N_("on-demand"), N_("control recursive fetching of submodules"), PARSE_OPT_OPTARG, option_fetch_parse_recurse_submodules }, + OPT_BOOL(0, "exit-code", &exit_code, + N_("exit successfully if refs are updated")), OPT_BOOL(0, "dry-run", &dry_run, N_("dry run")), OPT_BOOL('k', "keep", &keep, N_("keep downloaded pack")), @@ -522,8 +526,10 @@ static int s_update_ref(const char *action, struct strbuf err = STRBUF_INIT; int ret, df_conflict = 0; - if (dry_run) + if (dry_run) { + updated_refs++; return 0; + } if (!rla) rla = default_rla.buf; msg = xstrfmt("%s: %s", rla, action); @@ -545,6 +551,7 @@ static int s_update_ref(const char *action, ref_transaction_free(transaction); strbuf_release(&err); free(msg); + updated_refs++; return 0; fail: ref_transaction_free(transaction); @@ -1680,5 +1687,9 @@ int cmd_fetch(int argc, const char **argv, const char *prefix) run_command_v_opt(argv_gc_auto.argv, RUN_GIT_CMD); argv_array_clear(&argv_gc_auto); - return result; + if (result) + return result; + if (exit_code) + return !updated_refs; + return 0; }
On Tue, May 14, 2019 at 12:13:50PM +0000, Eric Wong wrote: > I'm not sure when/if I'll have time for this; but this ought to > be possible: > > GIT_DIR=$HTTP_URL git <any read-only command> > > And possible without existing admins to setup or change > anything on their server. > [...] > git doesn't need mmap; and curl + Range requests ought to be > able to get us what we need to emulate pread. It'd be great for > low-latency LANs, maybe not so great with high latency; but > probably better in many cases than cloning a giant repo to cat > one blob. My first thought here is that we could probably just use the filesystem boundary as the appropriate layer, and have an httpfs fuse driver. Your mention of fusedav makes me think you've already gone this route. I guess non-dav http doesn't necessarily let us enumerate directories. But it might be enough if we could insert ourselves in a few key spots. E.g., when we try to opendir("objects/packs") and it fails with ENOSYS or similar, then we fallback to opening "objects/info/packs" to get the same information (and ditto for loose ref traversal). There'd still be _some_ work in Git, but it wouldn't require replacing every single read with HTTP magic. > Also, cloning on a static bundle ought to be doable with: > > git clone $REMOTE_OR_LOCAL_PATH/foo.bundle Some nearby discussion: https://public-inbox.org/git/20190514092900.GA11679@sigill.intra.peff.net/ > And yeah, it also sucks that bundles double storage overhead > for admins; it would be nice if I could use bundles as alternates > or packs... Yes. It's nice that they're a single file for some uses, but in terms of implementation it would be easier if the bundle files just said "here are some refs, and you can find my packfile in the relative filename XYZ.pack". And then you'd just store the two of them together (and a matching .idx if you wanted to use it as a real pack, but the bundle readers wouldn't care). It probably wouldn't be that hard to wedge that into the bundle format by bumping the version field. -Peff
On Tue, May 14, 2019 at 02:19:36PM +0200, Ævar Arnfjörð Bjarmason wrote: > Aside from this change, I wonder if making "fetch" optionally "exit 1" > if no refs were updated would be useful, as in the below WIP. Of course > it would be better to distinguish errors from "no refs to update". That seems very sensible to me, as long as it's an optional flag as you have it here. > @@ -133,6 +135,8 @@ static struct option builtin_fetch_options[] = { > { OPTION_CALLBACK, 0, "recurse-submodules", &recurse_submodules, N_("on-demand"), > N_("control recursive fetching of submodules"), > PARSE_OPT_OPTARG, option_fetch_parse_recurse_submodules }, > + OPT_BOOL(0, "exit-code", &exit_code, > + N_("exit successfully if refs are updated")), I think it makes sense to flip this explanation: "exit non-zero if no refs are updated" or "treat it as an error if no refs are updated". Since the default behavior is already to exit successfully in this case. :) We may also want to advertise a particular result code so callers can distinguish it from regular die() errors. -Peff
diff --git a/server-info.c b/server-info.c index 41274d098b..34599e4817 100644 --- a/server-info.c +++ b/server-info.c @@ -6,37 +6,78 @@ #include "tag.h" #include "packfile.h" #include "object-store.h" +#include "strbuf.h" + +static int files_differ(FILE *fp, const char *path) +{ + struct stat st; + git_hash_ctx c; + struct object_id oid_old, oid_new; + struct strbuf tmp = STRBUF_INIT; + long new_len = ftell(fp); + + if (new_len < 0 || stat(path, &st) < 0) + return 1; + if (!S_ISREG(st.st_mode)) + return 1; + if ((off_t)new_len != st.st_size) + return 1; + + rewind(fp); + if (strbuf_fread(&tmp, (size_t)new_len, fp) != (size_t)new_len) + return 1; + the_hash_algo->init_fn(&c); + the_hash_algo->update_fn(&c, tmp.buf, tmp.len); + the_hash_algo->final_fn(oid_new.hash, &c); + strbuf_release(&tmp); + + if (strbuf_read_file(&tmp, path, (size_t)st.st_size) < 0) + return 1; + the_hash_algo->init_fn(&c); + the_hash_algo->update_fn(&c, tmp.buf, tmp.len); + the_hash_algo->final_fn(oid_old.hash, &c); + strbuf_release(&tmp); + + return hashcmp(oid_old.hash, oid_new.hash); +} /* * Create the file "path" by writing to a temporary file and renaming * it into place. The contents of the file come from "generate", which * should return non-zero if it encounters an error. */ -static int update_info_file(char *path, int (*generate)(FILE *)) +static int update_info_file(char *path, int (*generate)(FILE *), int force) { char *tmp = mkpathdup("%s_XXXXXX", path); int ret = -1; int fd = -1; FILE *fp = NULL, *to_close; + int do_update; safe_create_leading_directories(path); fd = git_mkstemp_mode(tmp, 0666); if (fd < 0) goto out; - to_close = fp = fdopen(fd, "w"); + to_close = fp = fdopen(fd, "w+"); if (!fp) goto out; fd = -1; ret = generate(fp); if (ret) goto out; + + do_update = force || files_differ(fp, path); fp = NULL; if (fclose(to_close)) goto out; - if (adjust_shared_perm(tmp) < 0) - goto out; - if (rename(tmp, path) < 0) - goto out; + if (do_update) { + if (adjust_shared_perm(tmp) < 0) + goto out; + if (rename(tmp, path) < 0) + goto out; + } else { + unlink(tmp); + } ret = 0; out: @@ -78,10 +119,10 @@ static int generate_info_refs(FILE *fp) return for_each_ref(add_info_ref, fp); } -static int update_info_refs(void) +static int update_info_refs(int force) { char *path = git_pathdup("info/refs"); - int ret = update_info_file(path, generate_info_refs); + int ret = update_info_file(path, generate_info_refs, force); free(path); return ret; } @@ -254,7 +295,7 @@ static int update_info_packs(int force) int ret; init_pack_info(infofile, force); - ret = update_info_file(infofile, write_pack_info_file); + ret = update_info_file(infofile, write_pack_info_file, force); free_pack_info(); free(infofile); return ret; @@ -269,7 +310,7 @@ int update_server_info(int force) */ int errs = 0; - errs = errs | update_info_refs(); + errs = errs | update_info_refs(force); errs = errs | update_info_packs(force); /* remove leftover rev-cache file if there is any */ diff --git a/t/t5200-update-server-info.sh b/t/t5200-update-server-info.sh new file mode 100755 index 0000000000..f1cec46914 --- /dev/null +++ b/t/t5200-update-server-info.sh @@ -0,0 +1,41 @@ +#!/bin/sh + +test_description='Test git stash show configuration.' + +. ./test-lib.sh + +test_expect_success 'setup' 'test_commit file' + +test_expect_success 'create info/refs' ' + git update-server-info && + test_path_is_file .git/info/refs +' + +test_expect_success 'modify and store mtime' ' + test-tool chmtime =0 .git/info/refs && + test-tool chmtime --get .git/info/refs >a +' + +test_expect_success 'info/refs is not needlessly overwritten' ' + git update-server-info && + test-tool chmtime --get .git/info/refs >b && + test_cmp a b +' + +test_expect_success 'info/refs can be forced to update' ' + git update-server-info -f && + test-tool chmtime --get .git/info/refs >b && + ! test_cmp a b +' + +test_expect_success 'info/refs updates when changes are made' ' + test-tool chmtime =0 .git/info/refs && + test-tool chmtime --get .git/info/refs >b && + test_cmp a b && + git update-ref refs/heads/foo HEAD && + git update-server-info && + test-tool chmtime --get .git/info/refs >b && + ! test_cmp a b +' + +test_done
Do not change the existing info/refs and objects/info/packs files if they match the existing content on the filesystem. This is intended to preserve mtime and make it easier for dumb HTTP pollers to rely on the If-Modified-Since header. Combined with stdio and kernel buffering; the kernel should be able to avoid block layer writes and reduce wear. Signed-off-by: Eric Wong <e@80x24.org> --- server-info.c | 61 +++++++++++++++++++++++++++++------ t/t5200-update-server-info.sh | 41 +++++++++++++++++++++++ 2 files changed, 92 insertions(+), 10 deletions(-) create mode 100755 t/t5200-update-server-info.sh