mbox series

[v4,00/29] tree-walk: mostly replace "mode" with "enum object_type"

Message ID cover.1616282533.git.avarab@gmail.com (mailing list archive)
Headers show
Series tree-walk: mostly replace "mode" with "enum object_type" | expand

Message

Ævar Arnfjörð Bjarmason March 21, 2021, midnight UTC
A re-roll of v3 of this series[1] and on top of (and requires) my
just-submitted v5 re-roll of the read_tree() refactoring series[2].

There was a regression in 1/32 of the old series. Removing the
canon_mode() call in diff.c didn't account for us needing to
canonicalize "diff --no-index" modes. There were no tests for this,
and it failed or not depending on the FS modes in the git.git checkout
being tested. This fixes the CI smoke coming from this series.

I plan to submit another set of patches to fix git diff --no-index's
test coverage, but for this series it was best to just eject it.

I'm confident that this is in a good shape now. There is now no
canon_mode() change in this series, just refactoring of
s/mode/object_type/g.

As discussed in earlier rounds the point of this prep series is to get
to a point where we can expose an API to fix long-standin blindspots
in fsck checks.

This is a long journey to get to that point, and isn't strictly
necessary to fix tht issue. But I think it's worth it since it leaves
the API use in a much better state where it's clear which things
actually require the mode bits, and which are just asking about the
object type of tree entries.

I dropped a couple of more patches to do with archive.c, in my
re-rolled [2] I'm now dealing with the "stage" parameter there, so
there's no need to do that here anymore.

1. https://lore.kernel.org/git/20210316155829.31242-1-avarab@gmail.com/
2. https://lore.kernel.org/git/xmqqy2ehctjo.fsf@gitster.g

Ævar Arnfjörð Bjarmason (29):
  notes & match-trees: use name_entry's "pathlen" member
  cache.h: add a comment to object_type()
  tree-walk.h: add object_type member to name_entry
  tree-walk.c: migrate to using new "object_type" field when possible
  fast-import tests: test for sorting dir/file foo v.s. foo.txt
  mktree tests: test that "mode" is passed when sorting
  diff tests: test that "mode" is passed when sorting
  cache.h: have base_name_compare() take "is tree?", not "mode"
  tree-walk.h users: switch object_type(...) to new .object_type
  tree.h: format argument lists of read_tree_recursive() users
  tree.h API: make read_tree_fn_t take an "enum object_type"
  tree-walk.h users: migrate "p->mode &&" pattern
  tree-walk.h users: refactor chained "mode" if/else into switch
  tree-walk.h users: migrate miscellaneous "mode" to "object_type"
  merge-tree tests: test for the mode comparison in same_entry()
  merge-ort: correct reference to test in 62fdec17a11
  fsck.c: switch on "object_type" in fsck_walk_tree()
  tree-walk.h users: use temporary variable(s) for "mode"
  tree-walk.h API: formatting changes for subsequent commit
  tree-walk.h API: rename get_tree_entry() to get_tree_entry_mode()
  match-trees: use "tmp" for mode in shift_tree_by()
  tree-walk.h API: add get_tree_entry_type()
  tree-walk.h API: document and format tree_entry_extract()
  tree-entry.h API: rename tree_entry_extract() to
    tree_entry_extract_mode()
  tree-walk.h API: add a tree_entry_extract_all() function
  tree-walk.h API: add get_tree_entry_all()
  tree-walk.h API: add a get_tree_entry_path() function
  blame: emit a better error on 'git blame directory'
  tree-walk.h API: add a tree_entry_extract_type() function

 archive.c                       | 29 ++++++-----
 blame.c                         |  9 ++--
 builtin/checkout.c              |  6 ++-
 builtin/fast-import.c           | 12 +++--
 builtin/grep.c                  |  6 +--
 builtin/log.c                   |  7 +--
 builtin/ls-files.c              |  6 ++-
 builtin/ls-tree.c               | 14 +++---
 builtin/merge-tree.c            | 30 +++++++----
 builtin/mktree.c                |  4 +-
 builtin/pack-objects.c          |  6 +--
 builtin/reflog.c                |  3 +-
 builtin/rm.c                    |  2 +-
 builtin/update-index.c          |  6 ++-
 cache-tree.c                    |  2 +-
 cache.h                         | 11 ++--
 combine-diff.c                  |  8 +--
 delta-islands.c                 |  2 +-
 fsck.c                          | 23 ++++-----
 http-push.c                     |  6 ++-
 line-log.c                      |  2 +-
 list-objects.c                  | 20 +++++---
 match-trees.c                   | 52 +++++++++----------
 merge-ort.c                     | 13 ++---
 merge-recursive.c               | 33 ++++++------
 notes.c                         | 14 +++---
 object-name.c                   |  7 ++-
 pack-bitmap-write.c             |  8 +--
 read-cache.c                    | 16 +++---
 revision.c                      | 12 +++--
 t/t1450-fsck.sh                 | 66 ++++++++++++++++++++++++
 t/t4300-merge-tree.sh           | 44 ++++++++++++++++
 t/t8004-blame-with-conflicts.sh | 21 ++++++++
 t/t9300-fast-import.sh          | 87 ++++++++++++++++++++++++++++++++
 tree-diff.c                     | 30 +++++++----
 tree-walk.c                     | 89 ++++++++++++++++++++++++---------
 tree-walk.h                     | 63 ++++++++++++++++++++---
 tree.c                          | 19 ++++---
 tree.h                          |  5 +-
 unpack-trees.c                  | 24 +++++----
 walker.c                        | 22 ++++----
 41 files changed, 606 insertions(+), 233 deletions(-)

Range-diff:
 1:  26bc38fbdd0 <  -:  ----------- diff.c: remove redundant canon_mode() call
 2:  8502cf8134b =  1:  0390b1e9ded notes & match-trees: use name_entry's "pathlen" member
 3:  68133fa6aaa =  2:  186f7a7a44b cache.h: add a comment to object_type()
 4:  9f714d6c01f =  3:  4fed508f3cb tree-walk.h: add object_type member to name_entry
 5:  6dbf2b0a6aa =  4:  c557b67231b tree-walk.c: migrate to using new "object_type" field when possible
 6:  354a8e9a2a1 =  5:  7016008554a fast-import tests: test for sorting dir/file foo v.s. foo.txt
 7:  e2331df28e1 =  6:  b0546197b1b mktree tests: test that "mode" is passed when sorting
 8:  9e9486c2ea0 =  7:  73e92ac187d diff tests: test that "mode" is passed when sorting
 9:  be5c713336a =  8:  1b6a10f814c cache.h: have base_name_compare() take "is tree?", not "mode"
10:  43623edddfb =  9:  f8912a409b5 tree-walk.h users: switch object_type(...) to new .object_type
11:  030898f884c ! 10:  df43f1a76ac tree.h: format argument lists of read_tree_recursive() users
    @@ archive.c: static int check_attr_export_subst(const struct attr_check *check)
      }
      
      static int write_archive_entry(const struct object_id *oid, const char *base,
    --		int baselen, const char *filename, unsigned mode, int stage,
    +-		int baselen, const char *filename, unsigned mode,
     -		void *context)
     +			       int baselen, const char *filename,
     +			       unsigned mode,
    -+			       int stage,
     +			       void *context)
      {
      	static struct strbuf path = STRBUF_INIT;
    @@ archive.c: static int write_directory(struct archiver_context *c)
     +					void *context)
      {
      	struct archiver_context *c = context;
    - 	int stage = 0;
    + 
     @@ archive.c: struct path_exists_context {
      };
      
    @@ tree.h: struct tree *parse_tree_indirect(const struct object_id *oid);
     +			      void *);
      
      int read_tree_at(struct repository *r,
    - 		 struct tree *tree,
    + 		 struct tree *tree, struct strbuf *base,
12:  0ce197950b0 <  -:  ----------- tree.h users: format argument lists in archive.c
13:  051c9f32ac6 <  -:  ----------- archive: get rid of 'stage' parameter
14:  bc73994b4c7 = 11:  152c6d88542 tree.h API: make read_tree_fn_t take an "enum object_type"
15:  e2b0964228f = 12:  1c2f65cb674 tree-walk.h users: migrate "p->mode &&" pattern
16:  29dbc4292e0 = 13:  163922d427c tree-walk.h users: refactor chained "mode" if/else into switch
17:  ada6d051769 = 14:  21df7c668be tree-walk.h users: migrate miscellaneous "mode" to "object_type"
18:  01860daa556 = 15:  592fecb7abc merge-tree tests: test for the mode comparison in same_entry()
19:  e4be48fb50f = 16:  092472f3c8d merge-ort: correct reference to test in 62fdec17a11
20:  189d2550fb8 = 17:  685da1abbdc fsck.c: switch on "object_type" in fsck_walk_tree()
21:  3b7b12e6f79 = 18:  4babecb7855 tree-walk.h users: use temporary variable(s) for "mode"
22:  6c3a07a3279 = 19:  7251654dd52 tree-walk.h API: formatting changes for subsequent commit
23:  879eb3da2a4 = 20:  9e521a3cbf2 tree-walk.h API: rename get_tree_entry() to get_tree_entry_mode()
24:  d2fa360ab9e = 21:  40f37e99cd9 match-trees: use "tmp" for mode in shift_tree_by()
25:  cc50cfcf517 = 22:  7f699bf2d5c tree-walk.h API: add get_tree_entry_type()
26:  f642a35482b = 23:  7ab7576cb16 tree-walk.h API: document and format tree_entry_extract()
27:  a0bcb59fa57 = 24:  a1bd81d64aa tree-entry.h API: rename tree_entry_extract() to tree_entry_extract_mode()
28:  c7d4ba77340 = 25:  ce7c19ad39c tree-walk.h API: add a tree_entry_extract_all() function
29:  1d5421d67af = 26:  95eec961be8 tree-walk.h API: add get_tree_entry_all()
30:  a3e1063ac45 = 27:  a2a34fd3e2d tree-walk.h API: add a get_tree_entry_path() function
31:  da3dd6dd532 = 28:  81da5490221 blame: emit a better error on 'git blame directory'
32:  80e5cb0b30c = 29:  4d51da4ea39 tree-walk.h API: add a tree_entry_extract_type() function

Comments

Junio C Hamano March 21, 2021, 1:16 a.m. UTC | #1
Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:

> A re-roll of v3 of this series[1] and on top of (and requires) my
> just-submitted v5 re-roll of the read_tree() refactoring series[2].
>
> There was a regression in 1/32 of the old series. Removing the
> canon_mode() call in diff.c didn't account for us needing to
> canonicalize "diff --no-index" modes. There were no tests for this,
> and it failed or not depending on the FS modes in the git.git checkout
> being tested. This fixes the CI smoke coming from this series.

Sorry, but quite honestly, I am not quite sure what value this
entire code churn is trying to add to the codebase.

The function signature of read_tree_fn_t callback function gets
changed from the mode bits (which is capable to express differences
between regular files, executable files and symbolic links) to "enum
object_type" (which can only say "this is a blob"), which is a
regression, no?  

A callback can no longer do things like this, for example:

static int add_path_to_index(const struct object_id *oid,
			     struct strbuf *base, const char *path,
			     unsigned int mode, void *context)
{
	struct index_state *istate = (struct index_state *)context;
	struct cache_entry *ce;
	size_t len = base->len;

	if (S_ISDIR(mode))
		return READ_TREE_RECURSIVE;

	strbuf_addstr(base, path);

	ce = make_cache_entry(istate, mode, oid, base->buf, 0, 0);
	ce->ce_flags |= CE_SKIP_WORKTREE;
	set_index_entry(istate, istate->cache_nr++, ce);

	strbuf_setlen(base, len);
	return 0;
}

where executableness or symlinkshood is lost.

This probably is the third time I caught similar "let's lose
information passed through the call chain as nobody seems to need
it" mistakes in the iterations of this series, and that is two times
too many.  We should learn from our earlier mistakes---tweaking of
the API that happens to be still OK with the current codebase can be
either a needless churn that loses useful expressiveness from the
API, or a useful clean-up to kill dead parameter or excess
flexibility.

And these three incidents we have seen so far are the former.

Thanks.
Ævar Arnfjörð Bjarmason March 21, 2021, 12:26 p.m. UTC | #2
On Sun, Mar 21 2021, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:
>
>> A re-roll of v3 of this series[1] and on top of (and requires) my
>> just-submitted v5 re-roll of the read_tree() refactoring series[2].
>>
>> There was a regression in 1/32 of the old series. Removing the
>> canon_mode() call in diff.c didn't account for us needing to
>> canonicalize "diff --no-index" modes. There were no tests for this,
>> and it failed or not depending on the FS modes in the git.git checkout
>> being tested. This fixes the CI smoke coming from this series.
>
> Sorry, but quite honestly, I am not quite sure what value this
> entire code churn is trying to add to the codebase.
>
> The function signature of read_tree_fn_t callback function gets
> changed from the mode bits (which is capable to express differences
> between regular files, executable files and symbolic links) to "enum
> object_type" (which can only say "this is a blob"), which is a
> regression, no?  
>
> A callback can no longer do things like this, for example:
>
> static int add_path_to_index(const struct object_id *oid,
> 			     struct strbuf *base, const char *path,
> 			     unsigned int mode, void *context)
> {
> 	struct index_state *istate = (struct index_state *)context;
> 	struct cache_entry *ce;
> 	size_t len = base->len;
>
> 	if (S_ISDIR(mode))
> 		return READ_TREE_RECURSIVE;
>
> 	strbuf_addstr(base, path);
>
> 	ce = make_cache_entry(istate, mode, oid, base->buf, 0, 0);
> 	ce->ce_flags |= CE_SKIP_WORKTREE;
> 	set_index_entry(istate, istate->cache_nr++, ce);
>
> 	strbuf_setlen(base, len);
> 	return 0;
> }
>
> where executableness or symlinkshood is lost.

Yes, that would be a serious regression. I agree that all these
functions/callbacks etc. should have a way to get at the mode bits.

I'm adding "enum object_type", not removing the "mode" parameter in
read_tree_fn_t. This function (which is in "seen" as 03316f20347
(sparse-index: implement ensure_full_index(), 2021-03-16)) works just
fine in combination with this series.

The other APIs modified here all retain the ability to give you the mode
bit, they (the tree-walk.h changes) just optionally give you the option
of getting just the type (or just the path), and as it turns out most
users of the API can be converted over to that.

> This probably is the third time I caught similar "let's lose
> information passed through the call chain as nobody seems to need
> it" mistakes in the iterations of this series, and that is two times
> too many.  We should learn from our earlier mistakes---tweaking of
> the API that happens to be still OK with the current codebase can be
> either a needless churn that loses useful expressiveness from the
> API, or a useful clean-up to kill dead parameter or excess
> flexibility.
>
> And these three incidents we have seen so far are the former.

The current codebase will allow you to stick arbitrary mode bits in
trees, we have an fsck check to prevent that which doesn't work. I had a
summary of this in v1, but should probably have provided a recap[1].

This series is an admittedly long journey towards fixing that. I've got
unsubmitted patchen on top that make that fsck check work again.

I think the root cause of these bugs and other ones I've found along the
way (some of which I'm not quite comfortable discussing the details of
on the public list yet) is that the tree walking API is unnecessarily
low-level for most callers.

Most of those callers don't care about the details of the mode bits, but
are just traversing a tree and doing something with one the object
types.

As opposed to having a mode, but do you want a mode as-is from a tree,
normalized to canon_mode() (for writing?) etc. I think being able to
clearly tell apart those callers from the simpler ones is a clear win.

So I'm hoping you'll bear with me & this series, sorry about the
breakages so far, in my slight defense they were all subtle testing
blind spots (but we now have tests!).

1. https://lore.kernel.org/git/20210308150650.18626-1-avarab@gmail.com/
Junio C Hamano March 21, 2021, 5:13 p.m. UTC | #3
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> Yes, that would be a serious regression. I agree that all these
> functions/callbacks etc. should have a way to get at the mode bits.
>
> I'm adding "enum object_type", not removing the "mode" parameter in
> read_tree_fn_t. This function (which is in "seen" as 03316f20347
> (sparse-index: implement ensure_full_index(), 2021-03-16)) works just
> fine in combination with this series.
>
> The other APIs modified here all retain the ability to give you the mode
> bit, they (the tree-walk.h changes) just optionally give you the option
> of getting just the type (or just the path), and as it turns out most
> users of the API can be converted over to that.

I have a vague feeling that such an approach may be still repeating
the same mistake.

If the original premise is that "unsigned mode" bit can be abused to
feed impossible values like 0100653 to the system and the code
should catch it, ...

> The current codebase will allow you to stick arbitrary mode bits in
> trees, ...

... then I would understand it if the approach is to introduce a
distinct type that enumerates all possible mode bit values (and
nothing else), and have the compiler validate the callchain, so that
nobody can pass bogus mode bits (and when reading mode bits fields
in existing objects, the one that converts from the series of octal
bytes to that distsinct type would notice and complain).  And we've
already seen from this particular breakages that the "distinct type"
appropriate to be used for that purpose is not "enum object_type".
It is not expressive enough to enumerate all possible mode bit
values (besides, an enum is interchangeable with an int, so there
isn't much protection we would be getting from the compiler---we
could use a small struct of a new type, and have one static const
instance for each possible mode bit combination, I guess, but the
point here is that insisting on using "enum object_type" seems to be
the source of the problem).

I am afraid that it is even worse to pass both object type and
"unsigned mode" together.  It would still leave room for a bug to
pass nonsense mode bits, which we said we wanted to catch in our
original mission statement.  In addition, we now have a new room for
a bug, which is to pass an inconsistent pair of object type and mode
to the callchain.  Somebody would need to say "yuck, we got a mode
100644 but the type says TREE" now, in addition to validating if the
mode is sensible, which we should be doing somehow, no?

So, I am not sure how these changes are making anything better.

> ... I had a
> summary of this in v1, but should probably have provided a recap[1].

Oh, absolutely.  When we iterate, we should be welcoming to those
who missed earlier iterations.
Ævar Arnfjörð Bjarmason March 21, 2021, 6:42 p.m. UTC | #4
On Sun, Mar 21 2021, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
>> Yes, that would be a serious regression. I agree that all these
>> functions/callbacks etc. should have a way to get at the mode bits.
>>
>> I'm adding "enum object_type", not removing the "mode" parameter in
>> read_tree_fn_t. This function (which is in "seen" as 03316f20347
>> (sparse-index: implement ensure_full_index(), 2021-03-16)) works just
>> fine in combination with this series.
>>
>> The other APIs modified here all retain the ability to give you the mode
>> bit, they (the tree-walk.h changes) just optionally give you the option
>> of getting just the type (or just the path), and as it turns out most
>> users of the API can be converted over to that.
>
> I have a vague feeling that such an approach may be still repeating
> the same mistake.
>
> If the original premise is that "unsigned mode" bit can be abused to
> feed impossible values like 0100653 to the system and the code
> should catch it, ...
>
>> The current codebase will allow you to stick arbitrary mode bits in
>> trees, ...
>
> ... then I would understand it if the approach is to introduce a
> distinct type that enumerates all possible mode bit values (and
> nothing else), and have the compiler validate the callchain, so that
> nobody can pass bogus mode bits (and when reading mode bits fields
> in existing objects, the one that converts from the series of octal
> bytes to that distsinct type would notice and complain).  And we've
> already seen from this particular breakages that the "distinct type"
> appropriate to be used for that purpose is not "enum object_type".
> It is not expressive enough to enumerate all possible mode bit
> values (besides, an enum is interchangeable with an int, so there
> isn't much protection we would be getting from the compiler---we
> could use a small struct of a new type, and have one static const
> instance for each possible mode bit combination, I guess, but the
> point here is that insisting on using "enum object_type" seems to be
> the source of the problem).

Yes, this is a good suggestion, but one which'll be much easier to do
after this series. Since the majority of callers don't care about the
raw mode bits or anything except if the entry is a blob/tree/commit.

> I am afraid that it is even worse to pass both object type and
> "unsigned mode" together.  It would still leave room for a bug to
> pass nonsense mode bits, which we said we wanted to catch in our
> original mission statement.  In addition, we now have a new room for
> a bug, which is to pass an inconsistent pair of object type and mode
> to the callchain.  Somebody would need to say "yuck, we got a mode
> 100644 but the type says TREE" now, in addition to validating if the
> mode is sensible, which we should be doing somehow, no?

Indeed. This goes back to my "bear with me" comment upthread. I'm
planning to fix these cases too, and starting by moving most things to
"enum object_type" makes that a lot easier.

Right now we simply trust the mode bits, but that opens the door to the
same sort of bug I'm fixing in another series (that I need to re-roll)
where we have a "type commit" and "object ID" in a tag, but the ID is
really a tree or whatever.

We then get confused because we trusted the invalid metadata over
checking the type of the object ID we have in the object store.

So this particular series doesn't fix this bug at the end of it, the
"object_type" in tree-walk.c is just a function of deriving it from the
mode bits.

But once I've audited the cases where we're really just acting on the
type and don't have or use the mode (so we know we're not re-writing the
mode somewhere) then we would set the object_type to the actual type, as
in what we'd get from oid_object_info(ID).

Then we can e.g. scream murder about a mode/type mismatch in fsck, but
still be able to list/diff/inspect etc. that object in any codepath that
just wants to e.g. recursively iterate over the tree.



> So, I am not sure how these changes are making anything better.
>
>> ... I had a
>> summary of this in v1, but should probably have provided a recap[1].
>
> Oh, absolutely.  When we iterate, we should be welcoming to those
> who missed earlier iterations.