mbox series

[RFC,0/6] ls-tree: introduce '--pattern' option

Message ID 20221117113023.65865-1-tenglong.tl@alibaba-inc.com (mailing list archive)
Headers show
Series ls-tree: introduce '--pattern' option | expand

Message

Teng Long Nov. 17, 2022, 11:30 a.m. UTC
From: Teng Long <dyroneteng@gmail.com>

This RFC patch introduce a new "ls-tree" option "--pattern", aim to match the
entries by regex then filter the output which we may want to achieve. It also
contains some commit for preparation or cleanup.

The idea may be not comprehensive and the tests for it might be insufficient
too, but I'd like to listen the suggestion from the community to decide if it's
worth going forward with.

Thanks.

Teng Long (6):
  ls-tree: cleanup the redundant SPACE
  t3104: remove shift code in 'test_ls_tree_format'
  ls-tree: optimize params of 'show_tree_common_default_long()'
  ls-tree: improving cohension in the print code
  ls-tree: introduce 'match_pattern()' function
  ls-tree: introduce '--pattern' option

 Documentation/git-ls-tree.txt |   7 ++-
 builtin/ls-tree.c             | 109 ++++++++++++++++++++++++++--------
 t/t3104-ls-tree-format.sh     |   1 -
 t/t3106-ls-tree-pattern.sh    |  70 ++++++++++++++++++++++
 4 files changed, 161 insertions(+), 26 deletions(-)
 create mode 100755 t/t3106-ls-tree-pattern.sh

Comments

Ævar Arnfjörð Bjarmason Nov. 17, 2022, 1:22 p.m. UTC | #1
On Thu, Nov 17 2022, Teng Long wrote:

> From: Teng Long <dyroneteng@gmail.com>
>
> This RFC patch introduce a new "ls-tree" option "--pattern", aim to match the
> entries by regex then filter the output which we may want to achieve. It also
> contains some commit for preparation or cleanup.
>
> The idea may be not comprehensive and the tests for it might be insufficient
> too, but I'd like to listen the suggestion from the community to decide if it's
> worth going forward with.

I applied this series, compiled with CFLAGS=-O3, and:
	
	$ hyperfine './git ls-tree --pattern=[ab]c.*d -r HEAD' './git ls-tree -r HEAD | grep [ab]c.*d' -w 10 -r 20
	Benchmark 1: ./git ls-tree --pattern=[ab]c.*d -r HEAD
	  Time (mean ± σ):      14.8 ms ±   0.1 ms    [User: 12.0 ms, System: 2.8 ms]
	  Range (min … max):    14.6 ms …  15.0 ms    20 runs
	
	Benchmark 2: ./git ls-tree -r HEAD | grep [ab]c.*d
	  Time (mean ± σ):      12.5 ms ±   0.1 ms    [User: 10.0 ms, System: 4.0 ms]
	  Range (min … max):    12.4 ms …  12.8 ms    20 runs
	
	Summary
	  './git ls-tree -r HEAD | grep [ab]c.*d' ran
	    1.18 ± 0.01 times faster than './git ls-tree --pattern=[ab]c.*d -r HEAD'

So the value-proposition isn't really clear to me, and the included
docs, commit messages & this CL don't answer the "why not just 'grep'"
question?

That's faster even with another process for me, but likely that's
because you're doing the regex matching really inefficiently
(e.g. malloc-ing again for each line), which could be "fixed".

But in any setup which cares about the performance you're likely piping
to another process anyway (the thing using the data), which could do
that filtering without thep "grep" process.

So I don't see the value in doing this, but maybe I'm just missing
something.

And, in terms of the complexity for git's implementation it would be
really good to avoid the complexity of a "--pattern", "--sort-lines"
etc., if those use-cases can be satisfied by piping into "grep" or
"sort" instead.

Some of the pre-cleanup here looks good, but it's unrelated to the rest
of the series. I think in any case that it would be nice to see that as
another topic.
Taylor Blau Nov. 17, 2022, 10:02 p.m. UTC | #2
On Thu, Nov 17, 2022 at 02:22:20PM +0100, Ævar Arnfjörð Bjarmason wrote:
>
> On Thu, Nov 17 2022, Teng Long wrote:
>
> > From: Teng Long <dyroneteng@gmail.com>
> >
> > This RFC patch introduce a new "ls-tree" option "--pattern", aim to match the
> > entries by regex then filter the output which we may want to achieve. It also
> > contains some commit for preparation or cleanup.
> >
> > The idea may be not comprehensive and the tests for it might be insufficient
> > too, but I'd like to listen the suggestion from the community to decide if it's
> > worth going forward with.
>
> I applied this series, compiled with CFLAGS=-O3, and:
>
> 	$ hyperfine './git ls-tree --pattern=[ab]c.*d -r HEAD' './git ls-tree -r HEAD | grep [ab]c.*d' -w 10 -r 20
> 	Benchmark 1: ./git ls-tree --pattern=[ab]c.*d -r HEAD
> 	  Time (mean ± σ):      14.8 ms ±   0.1 ms    [User: 12.0 ms, System: 2.8 ms]
> 	  Range (min … max):    14.6 ms …  15.0 ms    20 runs
>
> 	Benchmark 2: ./git ls-tree -r HEAD | grep [ab]c.*d
> 	  Time (mean ± σ):      12.5 ms ±   0.1 ms    [User: 10.0 ms, System: 4.0 ms]
> 	  Range (min … max):    12.4 ms …  12.8 ms    20 runs
>
> 	Summary
> 	  './git ls-tree -r HEAD | grep [ab]c.*d' ran
> 	    1.18 ± 0.01 times faster than './git ls-tree --pattern=[ab]c.*d -r HEAD'
>
> So the value-proposition isn't really clear to me, and the included
> docs, commit messages & this CL don't answer the "why not just 'grep'"
> question?
>
> That's faster even with another process for me, but likely that's
> because you're doing the regex matching really inefficiently
> (e.g. malloc-ing again for each line), which could be "fixed".
>
> But in any setup which cares about the performance you're likely piping
> to another process anyway (the thing using the data), which could do
> that filtering without thep "grep" process.
>
> So I don't see the value in doing this, but maybe I'm just missing
> something.

I think this falls into the same trap as the series on 'git show-ref
--count' that I worked on earlier this year [1].

At the time, it seemed useful to me (since I was working in an
environment where counting the number of lines from 'show-ref' was more
cumbersome than teaching 'show-ref' how to perform the same task
itself).

And I stand by that value judgement, but sharing the patches with the
Git mailing list under the intent to merge it in was wrong. Those
patches were too niche to be more generally useful, and would only serve
to clutter up the UI of show-ref for everybody else.

So I was glad to drop that topic. Now, I'd be curious to hear from Teng
whether or not there *is* something that we're missing, since if so, I
definitely want to know what it is.

But absent of that, I tend to agree with Ævar that I'm not compelled by
replacing 'ls-tree | grep <pattern>' with 'ls-tree --pattern=<pattern>',
especially if the latter is slower than the former.

Thanks,
Taylor

[1]: https://lore.kernel.org/git/cover.1654552560.git.me@ttaylorr.com/
Teng Long Nov. 21, 2022, 11:41 a.m. UTC | #3
Taylor Blau <me@ttaylorr.com> writes:

> I think this falls into the same trap as the series on 'git show-ref
> --count' that I worked on earlier this year [1].
>
> At the time, it seemed useful to me (since I was working in an
> environment where counting the number of lines from 'show-ref' was more
> cumbersome than teaching 'show-ref' how to perform the same task
> itself).
>
> And I stand by that value judgement, but sharing the patches with the
> Git mailing list under the intent to merge it in was wrong. Those
> patches were too niche to be more generally useful, and would only serve
> to clutter up the UI of show-ref for everybody else.
>
> So I was glad to drop that topic. Now, I'd be curious to hear from Teng
> whether or not there *is* something that we're missing, since if so, I
> definitely want to know what it is.
>
> But absent of that, I tend to agree with Ævar that I'm not compelled by
> replacing 'ls-tree | grep <pattern>' with 'ls-tree --pattern=<pattern>',
> especially if the latter is slower than the former.
>
> Thanks,
> Taylor
>
> [1]: https://lore.kernel.org/git/cover.1654552560.git.me@ttaylorr.com/

honestly, I just think it's useful to me, but omit the performance recession of
the option. I originally thought about it looks like the "git tag -l <pattern>"
or "git branch -l <pattern>" usage, but it seems not as a regex matching on
them and it indeed executes faster than the pipe grep, because it seems like the
former has the more restrictive matching conditions (because if I move the
last aster, there is no output):


✗ git branch -r --list "avar*" | wc -l
    1498

✗ hyperfine 'git branch -r --list "avar*"'
Benchmark 1: git branch -r --list "avar*"
  Time (mean ± σ):      69.8 ms ±   3.1 ms    [User: 25.8 ms, System: 42.7 ms]
  Range (min … max):    66.6 ms …  81.8 ms    35 runs

✗ hyperfine 'git branch -r --list | grep "avar"'
Benchmark 1: git branch -r --list | grep "avar"
  Time (mean ± σ):      76.4 ms ±   3.7 ms    [User: 32.7 ms, System: 45.2 ms]
  Range (min … max):    72.9 ms …  85.5 ms    34 runs

➜  Documentation git:(tl/extra_bitmap_relative_path) ✗ git branch -r --list "avar" | wc -l
       0

So unless someone finds other benefits, such as the implementation of "git tag
-l <pattern>" to solve the performance problem, this method can bring benefits
in the corresponding scenario. It's ok for me to continue or stop this patch.

Thanks.
Ævar Arnfjörð Bjarmason Nov. 21, 2022, 12:12 p.m. UTC | #4
On Mon, Nov 21 2022, Teng Long wrote:

> Taylor Blau <me@ttaylorr.com> writes:
>
>> I think this falls into the same trap as the series on 'git show-ref
>> --count' that I worked on earlier this year [1].
>>
>> At the time, it seemed useful to me (since I was working in an
>> environment where counting the number of lines from 'show-ref' was more
>> cumbersome than teaching 'show-ref' how to perform the same task
>> itself).
>>
>> And I stand by that value judgement, but sharing the patches with the
>> Git mailing list under the intent to merge it in was wrong. Those
>> patches were too niche to be more generally useful, and would only serve
>> to clutter up the UI of show-ref for everybody else.
>>
>> So I was glad to drop that topic. Now, I'd be curious to hear from Teng
>> whether or not there *is* something that we're missing, since if so, I
>> definitely want to know what it is.
>>
>> But absent of that, I tend to agree with Ævar that I'm not compelled by
>> replacing 'ls-tree | grep <pattern>' with 'ls-tree --pattern=<pattern>',
>> especially if the latter is slower than the former.
>>
>> Thanks,
>> Taylor
>>
>> [1]: https://lore.kernel.org/git/cover.1654552560.git.me@ttaylorr.com/
>
> honestly, I just think it's useful to me, but omit the performance recession of
> the option. I originally thought about it looks like the "git tag -l <pattern>"
> or "git branch -l <pattern>" usage, but it seems not as a regex matching on
> them and it indeed executes faster than the pipe grep, because it seems like the
> former has the more restrictive matching conditions (because if I move the
> last aster, there is no output):
>
>
> ✗ git branch -r --list "avar*" | wc -l
>     1498
>
> ✗ hyperfine 'git branch -r --list "avar*"'
> Benchmark 1: git branch -r --list "avar*"
>   Time (mean ± σ):      69.8 ms ±   3.1 ms    [User: 25.8 ms, System: 42.7 ms]
>   Range (min … max):    66.6 ms …  81.8 ms    35 runs
>
> ✗ hyperfine 'git branch -r --list | grep "avar"'
> Benchmark 1: git branch -r --list | grep "avar"
>   Time (mean ± σ):      76.4 ms ±   3.7 ms    [User: 32.7 ms, System: 45.2 ms]
>   Range (min … max):    72.9 ms …  85.5 ms    34 runs
>
> ➜  Documentation git:(tl/extra_bitmap_relative_path) ✗ git branch -r --list "avar" | wc -l
>        0

Yeah, that's the built-in wildmatch() (as in wildmatch.c
in-tree)-powered matching we do, and support in pretty much anything
that takes a <path>.

I *thought* this feature was something like that that when I first
glanced at it, i.e. to regex match ls-tree entries, but e.g. a
--pattern=abc would still match that in the OID, as it's just grepping
the line.

Which I think is one way to draw the "does this belong in git?"
line. I.e. a "grep" or your "--pattern" doesn't need to know anything
about sub-components of the line to be equivalent, whereas something
that just matches the path does.

Also, when you get into e.g. negative pathspecs and the like it becomes
even more compelling.

I do have some WIP patches somewhere to have our pathspecs support PCRE
regexes.

So I think it would be neat, or at least worth exploring. But just
having ls-tree or some random command support it isn't the righ way to
go about it IMO, it should be done via the pathspec API, if done at all.