Message ID | 20221117113023.65865-1-tenglong.tl@alibaba-inc.com (mailing list archive) |
---|---|
Headers | show |
Series | ls-tree: introduce '--pattern' option | expand |
On Thu, Nov 17 2022, Teng Long wrote: > From: Teng Long <dyroneteng@gmail.com> > > This RFC patch introduce a new "ls-tree" option "--pattern", aim to match the > entries by regex then filter the output which we may want to achieve. It also > contains some commit for preparation or cleanup. > > The idea may be not comprehensive and the tests for it might be insufficient > too, but I'd like to listen the suggestion from the community to decide if it's > worth going forward with. I applied this series, compiled with CFLAGS=-O3, and: $ hyperfine './git ls-tree --pattern=[ab]c.*d -r HEAD' './git ls-tree -r HEAD | grep [ab]c.*d' -w 10 -r 20 Benchmark 1: ./git ls-tree --pattern=[ab]c.*d -r HEAD Time (mean ± σ): 14.8 ms ± 0.1 ms [User: 12.0 ms, System: 2.8 ms] Range (min … max): 14.6 ms … 15.0 ms 20 runs Benchmark 2: ./git ls-tree -r HEAD | grep [ab]c.*d Time (mean ± σ): 12.5 ms ± 0.1 ms [User: 10.0 ms, System: 4.0 ms] Range (min … max): 12.4 ms … 12.8 ms 20 runs Summary './git ls-tree -r HEAD | grep [ab]c.*d' ran 1.18 ± 0.01 times faster than './git ls-tree --pattern=[ab]c.*d -r HEAD' So the value-proposition isn't really clear to me, and the included docs, commit messages & this CL don't answer the "why not just 'grep'" question? That's faster even with another process for me, but likely that's because you're doing the regex matching really inefficiently (e.g. malloc-ing again for each line), which could be "fixed". But in any setup which cares about the performance you're likely piping to another process anyway (the thing using the data), which could do that filtering without thep "grep" process. So I don't see the value in doing this, but maybe I'm just missing something. And, in terms of the complexity for git's implementation it would be really good to avoid the complexity of a "--pattern", "--sort-lines" etc., if those use-cases can be satisfied by piping into "grep" or "sort" instead. Some of the pre-cleanup here looks good, but it's unrelated to the rest of the series. I think in any case that it would be nice to see that as another topic.
On Thu, Nov 17, 2022 at 02:22:20PM +0100, Ævar Arnfjörð Bjarmason wrote: > > On Thu, Nov 17 2022, Teng Long wrote: > > > From: Teng Long <dyroneteng@gmail.com> > > > > This RFC patch introduce a new "ls-tree" option "--pattern", aim to match the > > entries by regex then filter the output which we may want to achieve. It also > > contains some commit for preparation or cleanup. > > > > The idea may be not comprehensive and the tests for it might be insufficient > > too, but I'd like to listen the suggestion from the community to decide if it's > > worth going forward with. > > I applied this series, compiled with CFLAGS=-O3, and: > > $ hyperfine './git ls-tree --pattern=[ab]c.*d -r HEAD' './git ls-tree -r HEAD | grep [ab]c.*d' -w 10 -r 20 > Benchmark 1: ./git ls-tree --pattern=[ab]c.*d -r HEAD > Time (mean ± σ): 14.8 ms ± 0.1 ms [User: 12.0 ms, System: 2.8 ms] > Range (min … max): 14.6 ms … 15.0 ms 20 runs > > Benchmark 2: ./git ls-tree -r HEAD | grep [ab]c.*d > Time (mean ± σ): 12.5 ms ± 0.1 ms [User: 10.0 ms, System: 4.0 ms] > Range (min … max): 12.4 ms … 12.8 ms 20 runs > > Summary > './git ls-tree -r HEAD | grep [ab]c.*d' ran > 1.18 ± 0.01 times faster than './git ls-tree --pattern=[ab]c.*d -r HEAD' > > So the value-proposition isn't really clear to me, and the included > docs, commit messages & this CL don't answer the "why not just 'grep'" > question? > > That's faster even with another process for me, but likely that's > because you're doing the regex matching really inefficiently > (e.g. malloc-ing again for each line), which could be "fixed". > > But in any setup which cares about the performance you're likely piping > to another process anyway (the thing using the data), which could do > that filtering without thep "grep" process. > > So I don't see the value in doing this, but maybe I'm just missing > something. I think this falls into the same trap as the series on 'git show-ref --count' that I worked on earlier this year [1]. At the time, it seemed useful to me (since I was working in an environment where counting the number of lines from 'show-ref' was more cumbersome than teaching 'show-ref' how to perform the same task itself). And I stand by that value judgement, but sharing the patches with the Git mailing list under the intent to merge it in was wrong. Those patches were too niche to be more generally useful, and would only serve to clutter up the UI of show-ref for everybody else. So I was glad to drop that topic. Now, I'd be curious to hear from Teng whether or not there *is* something that we're missing, since if so, I definitely want to know what it is. But absent of that, I tend to agree with Ævar that I'm not compelled by replacing 'ls-tree | grep <pattern>' with 'ls-tree --pattern=<pattern>', especially if the latter is slower than the former. Thanks, Taylor [1]: https://lore.kernel.org/git/cover.1654552560.git.me@ttaylorr.com/
Taylor Blau <me@ttaylorr.com> writes: > I think this falls into the same trap as the series on 'git show-ref > --count' that I worked on earlier this year [1]. > > At the time, it seemed useful to me (since I was working in an > environment where counting the number of lines from 'show-ref' was more > cumbersome than teaching 'show-ref' how to perform the same task > itself). > > And I stand by that value judgement, but sharing the patches with the > Git mailing list under the intent to merge it in was wrong. Those > patches were too niche to be more generally useful, and would only serve > to clutter up the UI of show-ref for everybody else. > > So I was glad to drop that topic. Now, I'd be curious to hear from Teng > whether or not there *is* something that we're missing, since if so, I > definitely want to know what it is. > > But absent of that, I tend to agree with Ævar that I'm not compelled by > replacing 'ls-tree | grep <pattern>' with 'ls-tree --pattern=<pattern>', > especially if the latter is slower than the former. > > Thanks, > Taylor > > [1]: https://lore.kernel.org/git/cover.1654552560.git.me@ttaylorr.com/ honestly, I just think it's useful to me, but omit the performance recession of the option. I originally thought about it looks like the "git tag -l <pattern>" or "git branch -l <pattern>" usage, but it seems not as a regex matching on them and it indeed executes faster than the pipe grep, because it seems like the former has the more restrictive matching conditions (because if I move the last aster, there is no output): ✗ git branch -r --list "avar*" | wc -l 1498 ✗ hyperfine 'git branch -r --list "avar*"' Benchmark 1: git branch -r --list "avar*" Time (mean ± σ): 69.8 ms ± 3.1 ms [User: 25.8 ms, System: 42.7 ms] Range (min … max): 66.6 ms … 81.8 ms 35 runs ✗ hyperfine 'git branch -r --list | grep "avar"' Benchmark 1: git branch -r --list | grep "avar" Time (mean ± σ): 76.4 ms ± 3.7 ms [User: 32.7 ms, System: 45.2 ms] Range (min … max): 72.9 ms … 85.5 ms 34 runs ➜ Documentation git:(tl/extra_bitmap_relative_path) ✗ git branch -r --list "avar" | wc -l 0 So unless someone finds other benefits, such as the implementation of "git tag -l <pattern>" to solve the performance problem, this method can bring benefits in the corresponding scenario. It's ok for me to continue or stop this patch. Thanks.
On Mon, Nov 21 2022, Teng Long wrote: > Taylor Blau <me@ttaylorr.com> writes: > >> I think this falls into the same trap as the series on 'git show-ref >> --count' that I worked on earlier this year [1]. >> >> At the time, it seemed useful to me (since I was working in an >> environment where counting the number of lines from 'show-ref' was more >> cumbersome than teaching 'show-ref' how to perform the same task >> itself). >> >> And I stand by that value judgement, but sharing the patches with the >> Git mailing list under the intent to merge it in was wrong. Those >> patches were too niche to be more generally useful, and would only serve >> to clutter up the UI of show-ref for everybody else. >> >> So I was glad to drop that topic. Now, I'd be curious to hear from Teng >> whether or not there *is* something that we're missing, since if so, I >> definitely want to know what it is. >> >> But absent of that, I tend to agree with Ævar that I'm not compelled by >> replacing 'ls-tree | grep <pattern>' with 'ls-tree --pattern=<pattern>', >> especially if the latter is slower than the former. >> >> Thanks, >> Taylor >> >> [1]: https://lore.kernel.org/git/cover.1654552560.git.me@ttaylorr.com/ > > honestly, I just think it's useful to me, but omit the performance recession of > the option. I originally thought about it looks like the "git tag -l <pattern>" > or "git branch -l <pattern>" usage, but it seems not as a regex matching on > them and it indeed executes faster than the pipe grep, because it seems like the > former has the more restrictive matching conditions (because if I move the > last aster, there is no output): > > > ✗ git branch -r --list "avar*" | wc -l > 1498 > > ✗ hyperfine 'git branch -r --list "avar*"' > Benchmark 1: git branch -r --list "avar*" > Time (mean ± σ): 69.8 ms ± 3.1 ms [User: 25.8 ms, System: 42.7 ms] > Range (min … max): 66.6 ms … 81.8 ms 35 runs > > ✗ hyperfine 'git branch -r --list | grep "avar"' > Benchmark 1: git branch -r --list | grep "avar" > Time (mean ± σ): 76.4 ms ± 3.7 ms [User: 32.7 ms, System: 45.2 ms] > Range (min … max): 72.9 ms … 85.5 ms 34 runs > > ➜ Documentation git:(tl/extra_bitmap_relative_path) ✗ git branch -r --list "avar" | wc -l > 0 Yeah, that's the built-in wildmatch() (as in wildmatch.c in-tree)-powered matching we do, and support in pretty much anything that takes a <path>. I *thought* this feature was something like that that when I first glanced at it, i.e. to regex match ls-tree entries, but e.g. a --pattern=abc would still match that in the OID, as it's just grepping the line. Which I think is one way to draw the "does this belong in git?" line. I.e. a "grep" or your "--pattern" doesn't need to know anything about sub-components of the line to be equivalent, whereas something that just matches the path does. Also, when you get into e.g. negative pathspecs and the like it becomes even more compelling. I do have some WIP patches somewhere to have our pathspecs support PCRE regexes. So I think it would be neat, or at least worth exploring. But just having ls-tree or some random command support it isn't the righ way to go about it IMO, it should be done via the pathspec API, if done at all.
From: Teng Long <dyroneteng@gmail.com> This RFC patch introduce a new "ls-tree" option "--pattern", aim to match the entries by regex then filter the output which we may want to achieve. It also contains some commit for preparation or cleanup. The idea may be not comprehensive and the tests for it might be insufficient too, but I'd like to listen the suggestion from the community to decide if it's worth going forward with. Thanks. Teng Long (6): ls-tree: cleanup the redundant SPACE t3104: remove shift code in 'test_ls_tree_format' ls-tree: optimize params of 'show_tree_common_default_long()' ls-tree: improving cohension in the print code ls-tree: introduce 'match_pattern()' function ls-tree: introduce '--pattern' option Documentation/git-ls-tree.txt | 7 ++- builtin/ls-tree.c | 109 ++++++++++++++++++++++++++-------- t/t3104-ls-tree-format.sh | 1 - t/t3106-ls-tree-pattern.sh | 70 ++++++++++++++++++++++ 4 files changed, 161 insertions(+), 26 deletions(-) create mode 100755 t/t3106-ls-tree-pattern.sh