Message ID | 20190619095858.30124-1-pclouds@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | Add 'ls-files --json' to dump the index in json | expand |
On 6/19/2019 5:58 AM, Nguyễn Thái Ngọc Duy wrote: > This is probably just my itch. Every time I have to do something with > the index, I need to add a little bit code here, a little bit there to > get a better "view" of the index. > > This solves it for me. It allows me to see pretty much everything in the > index (except really low detail stuff like pathname compression). It's > readable by human, but also easy to parse if you need to do statistics > and stuff. You could even do a "diff" between two indexes. > > I'm not really sure if anybody else finds this useful. Because if not, > I guess there's not much point trying to merge it to git.git just for a > single user. Maintaining off tree is still a pain for me, but I think > I can manage it. I think we (Microsoft/VFS for Git engineers) would use this tool, as we frequently need to diagnose something that went wrong in a user's index. Kevin Willford built a tool to search the index and figure out what's going on, but I'm not sure it parses all of the new extensions or was updated to parse the v5 index. Having a translation from the internal index format to an easier-to-parse format is valuable. Thanks, -Stolee
On Wed, Jun 19, 2019 at 6:58 PM Derrick Stolee <stolee@gmail.com> wrote: > > On 6/19/2019 5:58 AM, Nguyễn Thái Ngọc Duy wrote: > > This is probably just my itch. Every time I have to do something with > > the index, I need to add a little bit code here, a little bit there to > > get a better "view" of the index. > > > > This solves it for me. It allows me to see pretty much everything in the > > index (except really low detail stuff like pathname compression). It's > > readable by human, but also easy to parse if you need to do statistics > > and stuff. You could even do a "diff" between two indexes. > > > > I'm not really sure if anybody else finds this useful. Because if not, > > I guess there's not much point trying to merge it to git.git just for a > > single user. Maintaining off tree is still a pain for me, but I think > > I can manage it. > > I think we (Microsoft/VFS for Git engineers) would use this tool, as we > frequently need to diagnose something that went wrong in a user's index. > Kevin Willford built a tool to search the index and figure out what's > going on, but I'm not sure it parses all of the new extensions or was > updated to parse the v5 index. OK I suggest you try it out and see if it really fits your internal tools. I wanted to balance between manual inspection and automation so the output may not be the best for tools. I also try not to freeze the format for more wiggle room, which would be fine for one-time scripts, but if you want to have real tools depend on it, we may have to look harder at the output format and make sure it's good enough for some time, and have some documentation. Also, I don't suppose it matters, but just for the record I don't care at all about --json performance. I suppose Jeff's json writer does not cache the entire json output in memory, so dumping giant index files is fine. But some other things, like reading the index with multiple threads, are also disabled.
On 6/19/2019 8:42 AM, Duy Nguyen wrote: > On Wed, Jun 19, 2019 at 6:58 PM Derrick Stolee <stolee@gmail.com> wrote: >> >> On 6/19/2019 5:58 AM, Nguyễn Thái Ngọc Duy wrote: >>> This is probably just my itch. Every time I have to do something with >>> the index, I need to add a little bit code here, a little bit there to >>> get a better "view" of the index. >>> >>> This solves it for me. It allows me to see pretty much everything in the >>> index (except really low detail stuff like pathname compression). It's >>> readable by human, but also easy to parse if you need to do statistics >>> and stuff. You could even do a "diff" between two indexes. >>> >>> I'm not really sure if anybody else finds this useful. Because if not, >>> I guess there's not much point trying to merge it to git.git just for a >>> single user. Maintaining off tree is still a pain for me, but I think >>> I can manage it. >> >> I think we (Microsoft/VFS for Git engineers) would use this tool, as we >> frequently need to diagnose something that went wrong in a user's index. >> Kevin Willford built a tool to search the index and figure out what's >> going on, but I'm not sure it parses all of the new extensions or was >> updated to parse the v5 index. > > OK I suggest you try it out and see if it really fits your internal > tools. I wanted to balance between manual inspection and automation so > the output may not be the best for tools. I also try not to freeze the > format for more wiggle room, which would be fine for one-time scripts, > but if you want to have real tools depend on it, we may have to look > harder at the output format and make sure it's good enough for some > time, and have some documentation. > > Also, I don't suppose it matters, but just for the record I don't care > at all about --json performance. I suppose Jeff's json writer does not > cache the entire json output in memory, so dumping giant index files > is fine. But some other things, like reading the index with multiple > threads, are also disabled. Performance is not critical here, and in fact would become slower for sure because of the extra parsing details. However, I think using JSON as a translation layer will make any tools that consume the JSON be more resilient to future index format updates. That stability is valuable. Even though the JSON format is not guaranteed to stay the same, it is easier to update an object model to the JSON format than a new binary parser. Thanks, -Stolee
On Wed, Jun 19, 2019 at 04:58:50PM +0700, Nguyễn Thái Ngọc Duy wrote: > This is probably just my itch. Every time I have to do something with > the index, I need to add a little bit code here, a little bit there to > get a better "view" of the index. > > This solves it for me. It allows me to see pretty much everything in the > index (except really low detail stuff like pathname compression). It's > readable by human, but also easy to parse if you need to do statistics > and stuff. You could even do a "diff" between two indexes. > > I'm not really sure if anybody else finds this useful. Because if not, > I guess there's not much point trying to merge it to git.git just for a > single user. Maintaining off tree is still a pain for me, but I think > I can manage it. I don't have any particular use for this, but I am all in favor of tools that make it easier to access and analyze information kept in our on-disk formats (some of this is available via --debug, I think, but AFAIK most of the extension bits are not). And I'd rather see something like JSON than inventing yet another ad-hoc output format. I think your warning in the manpage that this is for debugging is fine, as it does not put us on the hook for maintaining the feature nor its format forever. We might want to call it "--debug=json" or something, though, in case we do want real stable json support later (though of course we would be free to steal the option then, since we're making no promises). -Peff
Nguyễn Thái Ngọc Duy <pclouds@gmail.com> writes: > This is probably just my itch. Every time I have to do something with > the index, I need to add a little bit code here, a little bit there to > get a better "view" of the index. ;-) JSON is not particularly my cup-of-tea but it is better than many other things exactly for one reason (everybody and their dog have heard of it), and certainly is much superiour than inventing our own ad-hoc format. Thanks for working on this (I do not expect I would see an immediate need for this myself, though).
On 6/19/2019 5:58 AM, Nguyễn Thái Ngọc Duy wrote: > This is probably just my itch. Every time I have to do something with > the index, I need to add a little bit code here, a little bit there to > get a better "view" of the index. > > This solves it for me. It allows me to see pretty much everything in the > index (except really low detail stuff like pathname compression). It's > readable by human, but also easy to parse if you need to do statistics > and stuff. You could even do a "diff" between two indexes. > > I'm not really sure if anybody else finds this useful. Because if not, > I guess there's not much point trying to merge it to git.git just for a > single user. Maintaining off tree is still a pain for me, but I think > I can manage it. > > Nguyễn Thái Ngọc Duy (8): > ls-files: add --json to dump the index > split-index.c: dump "link" extension as json > fsmonitor.c: dump "FSMN" extension as json > resolve-undo.c: dump "REUC" extension as json > read-cache.c: dump "EOIE" extension as json > read-cache.c: dump "IEOT" extension as json > cache-tree.c: dump "TREE" extension as json > dir.c: dump "UNTR" extension as json > > Documentation/git-ls-files.txt | 5 ++ > builtin/ls-files.c | 30 +++++-- > cache-tree.c | 41 ++++++++-- > cache-tree.h | 5 +- > cache.h | 2 + > dir.c | 56 ++++++++++++- > dir.h | 4 +- > fsmonitor.c | 9 +++ > json-writer.c | 30 +++++++ > json-writer.h | 29 +++++++ > read-cache.c | 139 ++++++++++++++++++++++++++++++--- > resolve-undo.c | 36 ++++++++- > resolve-undo.h | 4 +- > split-index.c | 13 ++- > 14 files changed, 376 insertions(+), 27 deletions(-) > Thanks for working on this! I've been wanting to do something like this for a while. I too am tired of digging thru hex dumps or "od" output whenever I have an odd problem to investigate. This will certainly help. Jeff
On Thu, Jun 20, 2019 at 2:17 AM Jeff King <peff@peff.net> wrote: > I think your warning in the manpage that this is for debugging is fine, > as it does not put us on the hook for maintaining the feature nor its > format forever. We might want to call it "--debug=json" or something, Hmm.. does it mean we make --debug PARSE_OPT_OPTARG? In other words, "--debug" still means "text", --debug=json is obvious, but "--debug json" means "text" debug with pathspec "json". Which is really horrible in my opinion. Or is it ok to just make the argument mandatory? That would be a behavior change, but I suppose --debug is a thing only we use and could still be a safe thing to do... > though, in case we do want real stable json support later (though of > course we would be free to steal the option then, since we're making no > promises). > > -Peff
Hi Peff, On Wed, 19 Jun 2019, Jeff King wrote: > On Wed, Jun 19, 2019 at 04:58:50PM +0700, Nguyễn Thái Ngọc Duy wrote: > > > This is probably just my itch. Every time I have to do something with > > the index, I need to add a little bit code here, a little bit there to > > get a better "view" of the index. > > > > This solves it for me. It allows me to see pretty much everything in the > > index (except really low detail stuff like pathname compression). It's > > readable by human, but also easy to parse if you need to do statistics > > and stuff. You could even do a "diff" between two indexes. > > > > I'm not really sure if anybody else finds this useful. Because if not, > > I guess there's not much point trying to merge it to git.git just for a > > single user. Maintaining off tree is still a pain for me, but I think > > I can manage it. > > I don't have any particular use for this, but I am all in favor of tools > that make it easier to access and analyze information kept in our > on-disk formats (some of this is available via --debug, I think, but > AFAIK most of the extension bits are not). > > And I'd rather see something like JSON than inventing yet another ad-hoc > output format. > > I think your warning in the manpage that this is for debugging is fine, > as it does not put us on the hook for maintaining the feature nor its > format forever. We might want to call it "--debug=json" or something, > though, in case we do want real stable json support later (though of > course we would be free to steal the option then, since we're making no > promises). Traditionally, we have not catered well to 3rd-party applications in Git, and this JSON format would provide a way out of that problem. So I would like *not* to lock the door on letting this feature stabilize organically. I'd be much more in favor of `--json[=<version>]`, with an initial version of 0 to indicate that it really is unstable for now. Ciao, Dscho
On Fri, Jun 21, 2019 at 8:16 PM Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > > I think your warning in the manpage that this is for debugging is fine, > > as it does not put us on the hook for maintaining the feature nor its > > format forever. We might want to call it "--debug=json" or something, > > though, in case we do want real stable json support later (though of > > course we would be free to steal the option then, since we're making no > > promises). > > Traditionally, we have not catered well to 3rd-party applications in Git, > and this JSON format would provide a way out of that problem. > > So I would like *not* to lock the door on letting this feature stabilize > organically. > > I'd be much more in favor of `--json[=<version>]`, with an initial version > of 0 to indicate that it really is unstable for now. Considering the amount of code to output these, supporting multiple formats would be a nightmare. I may be ok with versioning the output so the tool know what format they need to deal with, but I'd rather support just one version. For third parties wanting to dig deep, I think libgit2 would be a much better fit.
Duy Nguyen <pclouds@gmail.com> writes: > Considering the amount of code to output these, supporting multiple > formats would be a nightmare. I may be ok with versioning the output > so the tool know what format they need to deal with, but I'd rather > support just one version. For third parties wanting to dig deep, I > think libgit2 would be a much better fit. Yeah, I think starting with --debug=json (or --debug-json) until we see some stability in the output and got comfortable to the idea of "version X" to mean what we output at that point, and then renaming it to "--json" with "version: 1" in the output stream so that third party can use it (and interpret it according to version 1 rules) is the way to go. Third-party tools are welcome to read --debug-json output as an early-adoption practice waiting for the real thing, but we do not want to be locked into a schema too eary before we are ready. Thanks.
On Fri, Jun 21, 2019 at 03:37:45PM +0700, Duy Nguyen wrote: > On Thu, Jun 20, 2019 at 2:17 AM Jeff King <peff@peff.net> wrote: > > I think your warning in the manpage that this is for debugging is fine, > > as it does not put us on the hook for maintaining the feature nor its > > format forever. We might want to call it "--debug=json" or something, > > Hmm.. does it mean we make --debug PARSE_OPT_OPTARG? In other words, > "--debug" still means "text", --debug=json is obvious, but "--debug > json" means "text" debug with pathspec "json". Which is really > horrible in my opinion. Yeah, that's the nature of OPTARG. ;) > Or is it ok to just make the argument mandatory? That would be a > behavior change, but I suppose --debug is a thing only we use and > could still be a safe thing to do... Yeah, I think that would be perfectly fine (or you could just call it --debug-json as a new option, if you didn't want to make people do --debug=text for the existing behavior). -Peff
On Fri, Jun 21, 2019 at 03:16:52PM +0200, Johannes Schindelin wrote: > > I think your warning in the manpage that this is for debugging is fine, > > as it does not put us on the hook for maintaining the feature nor its > > format forever. We might want to call it "--debug=json" or something, > > though, in case we do want real stable json support later (though of > > course we would be free to steal the option then, since we're making no > > promises). > > Traditionally, we have not catered well to 3rd-party applications in Git, > and this JSON format would provide a way out of that problem. > > So I would like *not* to lock the door on letting this feature stabilize > organically. I'd like it to stabilize organically, too, but my thinking was that we'd wait a while and then promote it to a stable name eventually. > I'd be much more in favor of `--json[=<version>]`, with an initial version > of 0 to indicate that it really is unstable for now. That's OK with me, too, if you think "0" indicates that sufficiently (we've used "v0" in a lot of other places to refer to stable protocols, like the git:// one). Maybe it's OK with some documentation making it clear. I'm not sure whether we want to be locked into supporting this v0 forever or not (though maybe it would not be such a burden). I think JSON-based output also has the potential to need fewer bumps. It's syntactically stable, so it's really just about our schema. And it's easy to say "newer versions of Git may produce new keys; you can ignore them", as long as we do not change the meaning of existing keys. That might be an easier promise to make. -Peff
On Fri, Jun 21, 2019 at 08:10:58AM -0700, Junio C Hamano wrote: > Duy Nguyen <pclouds@gmail.com> writes: > > > Considering the amount of code to output these, supporting multiple > > formats would be a nightmare. I may be ok with versioning the output > > so the tool know what format they need to deal with, but I'd rather > > support just one version. For third parties wanting to dig deep, I > > think libgit2 would be a much better fit. > > Yeah, I think starting with --debug=json (or --debug-json) until we > see some stability in the output and got comfortable to the idea of > "version X" to mean what we output at that point, and then renaming > it to "--json" with "version: 1" in the output stream so that third > party can use it (and interpret it according to version 1 rules) is > the way to go. Third-party tools are welcome to read --debug-json > output as an early-adoption practice waiting for the real thing, but > we do not want to be locked into a schema too eary before we are > ready. I should have read the whole thread before responding. I made a similar comment to Dscho, so I guess that is now two of us. :) -Peff
On 2019-06-19 at 09:58:50, Nguyễn Thái Ngọc Duy wrote: > This is probably just my itch. Every time I have to do something with > the index, I need to add a little bit code here, a little bit there to > get a better "view" of the index. > > This solves it for me. It allows me to see pretty much everything in the > index (except really low detail stuff like pathname compression). It's > readable by human, but also easy to parse if you need to do statistics > and stuff. You could even do a "diff" between two indexes. > > I'm not really sure if anybody else finds this useful. Because if not, > I guess there's not much point trying to merge it to git.git just for a > single user. Maintaining off tree is still a pain for me, but I think > I can manage it. I'm generally in favor of this, but we need to document what this does when it encounters paths that are not valid UTF-8. (Ideally, the answer is, "die()", but I suspect the answer will be "silently produce invalid output".) Those can of course occur on Unix systems, but also on Windows, where unpaired surrogates can occur.
On Sat, Jun 22, 2019 at 6:31 AM brian m. carlson <sandals@crustytoothpaste.net> wrote: > > On 2019-06-19 at 09:58:50, Nguyễn Thái Ngọc Duy wrote: > > This is probably just my itch. Every time I have to do something with > > the index, I need to add a little bit code here, a little bit there to > > get a better "view" of the index. > > > > This solves it for me. It allows me to see pretty much everything in the > > index (except really low detail stuff like pathname compression). It's > > readable by human, but also easy to parse if you need to do statistics > > and stuff. You could even do a "diff" between two indexes. > > > > I'm not really sure if anybody else finds this useful. Because if not, > > I guess there's not much point trying to merge it to git.git just for a > > single user. Maintaining off tree is still a pain for me, but I think > > I can manage it. > > I'm generally in favor of this, but we need to document what this does > when it encounters paths that are not valid UTF-8. (Ideally, the answer > is, "die()", but I suspect the answer will be "silently produce invalid > output".) I think you're right, we don't assume anything when writing json strings, so it's not going to be utf-8 (or die) if the path is also not valid utf-8. The good thing is all this could be done in just one place, append_quoted_string(), if someone needs too. I'll just go document the fact that we may produce invalid UTF-8.
Hi Duy, On Fri, 21 Jun 2019, Duy Nguyen wrote: > On Fri, Jun 21, 2019 at 8:16 PM Johannes Schindelin > <Johannes.Schindelin@gmx.de> wrote: > > > > I think your warning in the manpage that this is for debugging is fine, > > > as it does not put us on the hook for maintaining the feature nor its > > > format forever. We might want to call it "--debug=json" or something, > > > though, in case we do want real stable json support later (though of > > > course we would be free to steal the option then, since we're making no > > > promises). > > > > Traditionally, we have not catered well to 3rd-party applications in Git, > > and this JSON format would provide a way out of that problem. > > > > So I would like *not* to lock the door on letting this feature stabilize > > organically. > > > > I'd be much more in favor of `--json[=<version>]`, with an initial version > > of 0 to indicate that it really is unstable for now. > > Considering the amount of code to output these, supporting multiple > formats would be a nightmare. I may be ok with versioning the output > so the tool know what format they need to deal with, but I'd rather > support just one version. Once the format stabilized, I don't think it would be a huge burden to support multiple formats, if we ever had to update. It would, however, be a huge burden on third-party applications. In effect, we could be lazy, but we would put a lot more burden on others than we saved ourselves, so that would be a bit... selfish. > For third parties wanting to dig deep, I think libgit2 would be a much > better fit. If we (i.e. the core Git contributors) were contributing new features/bug fixes to libgit2, that would be a good recommendation. But we don't. We essentially ignore libgit2 (and all of their learnings) all the time. Even worse, for years, even decades, we recommended the command-line as "the API". If you want to reverse that recommendation, I think it merits a bigger discussion than a flimsical comment buried in a thread about an experimental feature. Ciao, Dscho
Hi Peff & Junio, On Fri, 21 Jun 2019, Jeff King wrote: > On Fri, Jun 21, 2019 at 08:10:58AM -0700, Junio C Hamano wrote: > > > Duy Nguyen <pclouds@gmail.com> writes: > > > > > Considering the amount of code to output these, supporting multiple > > > formats would be a nightmare. I may be ok with versioning the output > > > so the tool know what format they need to deal with, but I'd rather > > > support just one version. For third parties wanting to dig deep, I > > > think libgit2 would be a much better fit. > > > > Yeah, I think starting with --debug=json (or --debug-json) until we > > see some stability in the output and got comfortable to the idea of > > "version X" to mean what we output at that point, and then renaming > > it to "--json" with "version: 1" in the output stream so that third > > party can use it (and interpret it according to version 1 rules) is > > the way to go. Third-party tools are welcome to read --debug-json > > output as an early-adoption practice waiting for the real thing, but > > we do not want to be locked into a schema too eary before we are > > ready. > > I should have read the whole thread before responding. I made a similar > comment to Dscho, so I guess that is now two of us. :) It is a bit of a chicken-and-egg problem. You want the format to stabilize. But you also don't want to commit to one final format. And you choose as option name a deliberately discouraging one, deterring the (third-party application) developers who could most help you evolve the format to a sensible and useful stable version. Ciao, Dscho
On Mon, Jun 24, 2019 at 4:32 PM Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > > Hi Duy, > > On Fri, 21 Jun 2019, Duy Nguyen wrote: > > > On Fri, Jun 21, 2019 at 8:16 PM Johannes Schindelin > > <Johannes.Schindelin@gmx.de> wrote: > > > > > > I think your warning in the manpage that this is for debugging is fine, > > > > as it does not put us on the hook for maintaining the feature nor its > > > > format forever. We might want to call it "--debug=json" or something, > > > > though, in case we do want real stable json support later (though of > > > > course we would be free to steal the option then, since we're making no > > > > promises). > > > > > > Traditionally, we have not catered well to 3rd-party applications in Git, > > > and this JSON format would provide a way out of that problem. > > > > > > So I would like *not* to lock the door on letting this feature stabilize > > > organically. > > > > > > I'd be much more in favor of `--json[=<version>]`, with an initial version > > > of 0 to indicate that it really is unstable for now. > > > > Considering the amount of code to output these, supporting multiple > > formats would be a nightmare. I may be ok with versioning the output > > so the tool know what format they need to deal with, but I'd rather > > support just one version. > > Once the format stabilized, I don't think it would be a huge burden to > support multiple formats, if we ever had to update. > > It would, however, be a huge burden on third-party applications. In > effect, we could be lazy, but we would put a lot more burden on others > than we saved ourselves, so that would be a bit... selfish. JSON is the land of high level languages. They can adapt to new format quite easily, compared to restructuring C to support multiple different formats. Yes I'm quite OK with being selfish in this case.
Hi Peff, On Fri, 21 Jun 2019, Jeff King wrote: > On Fri, Jun 21, 2019 at 03:16:52PM +0200, Johannes Schindelin wrote: > > > > I think your warning in the manpage that this is for debugging is fine, > > > as it does not put us on the hook for maintaining the feature nor its > > > format forever. We might want to call it "--debug=json" or something, > > > though, in case we do want real stable json support later (though of > > > course we would be free to steal the option then, since we're making no > > > promises). > > > > Traditionally, we have not catered well to 3rd-party applications in Git, > > and this JSON format would provide a way out of that problem. > > > > So I would like *not* to lock the door on letting this feature stabilize > > organically. > > I'd like it to stabilize organically, too, but my thinking was that we'd > wait a while and then promote it to a stable name eventually. Git's command-line options have stabilized organically. Example: to include untracked files in `git stash`, use `-u` or `--include-untracked`, to include them in `git add`, use `-A` or `--all`, to include them in `git grep`, use `--untracked` (no short option), to include them in `git ls-files`, use `-o` or `--others`. The command `git commit` does not even have an option to include untracked files. You know of more examples of organically grown designs in Git, I am sure. Given those examples, I am not sure that I want the JSON format to stabilize organically. > > I'd be much more in favor of `--json[=<version>]`, with an initial > > version of 0 to indicate that it really is unstable for now. > > That's OK with me, too, if you think "0" indicates that sufficiently > (we've used "v0" in a lot of other places to refer to stable protocols, > like the git:// one). Maybe it's OK with some documentation making it > clear. I did think that the `0` would be clear, but you are probably right. > I'm not sure whether we want to be locked into supporting this v0 > forever or not (though maybe it would not be such a burden). > > I think JSON-based output also has the potential to need fewer bumps. > It's syntactically stable, so it's really just about our schema. And > it's easy to say "newer versions of Git may produce new keys; you can > ignore them", as long as we do not change the meaning of existing keys. > That might be an easier promise to make. Right. Thanks, Dscho