diff mbox series

[v5,10/14] diff-lib: handle index diffs with sparse dirs

Message ID b9b97e0112939d1787ff1d2a13c48e5b406408db.1623069252.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series Sparse-index: integrate with status | expand

Commit Message

Derrick Stolee June 7, 2021, 12:34 p.m. UTC
From: Derrick Stolee <dstolee@microsoft.com>

While comparing an index to a tree, we may see a sparse directory entry.
In this case, we should compare that portion of the tree to the tree
represented by that entry. This could include a new tree which needs to
be expanded to a full list of added files. It could also include an
existing tree, in which case all of the changes inside are important to
describe, including the modifications, additions, and deletions. Note
that the case where the tree has a path and the index does not remains
identical to before: the lack of a cache entry is the same with a sparse
index.

In the case where a tree is modified, we need to expand the tree
recursively, and start comparing each contained entry as either an
addition, deletion, or modification. This causes an interesting
recursion that did not exist before.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 diff-lib.c | 188 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 188 insertions(+)

Comments

Derrick Stolee June 7, 2021, 3:26 p.m. UTC | #1
On 6/7/2021 8:34 AM, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
...
> +			old_entry = make_transient_cache_entry(
> +					entry[0].mode, &entry[0].oid,
> +					old_path, /* stage */ 0);

I didn't realize this before I started integrating with
v2.32.0 (which I should have done before submitting v5) that
make_transient_cache_entry() has changed its prototype to
include a memory pool parameter.

I'm working on a v6 that makes only this update and it will
probably be ready tomorrow.

Thanks,
-Stolee
Junio C Hamano June 8, 2021, 1:05 a.m. UTC | #2
Derrick Stolee <stolee@gmail.com> writes:

> On 6/7/2021 8:34 AM, Derrick Stolee via GitGitGadget wrote:
>> From: Derrick Stolee <dstolee@microsoft.com>
> ...
>> +			old_entry = make_transient_cache_entry(
>> +					entry[0].mode, &entry[0].oid,
>> +					old_path, /* stage */ 0);
>
> I didn't realize this before I started integrating with
> v2.32.0 (which I should have done before submitting v5) that
> make_transient_cache_entry() has changed its prototype to
> include a memory pool parameter.

Sorry for the trouble---these are usually all known to me for topics
I happened to have picked up in 'seen', since I try to make it a rule
that 'seen' must be a descendant of 'master'.

How can I usefully communicate the conflicts I find out during the
integration cycles to topic owners, I wonder.

Thanks.
Derrick Stolee June 8, 2021, 1 p.m. UTC | #3
On 6/7/2021 9:05 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
> 
>> On 6/7/2021 8:34 AM, Derrick Stolee via GitGitGadget wrote:
>>> From: Derrick Stolee <dstolee@microsoft.com>
>> ...
>>> +			old_entry = make_transient_cache_entry(
>>> +					entry[0].mode, &entry[0].oid,
>>> +					old_path, /* stage */ 0);
>>
>> I didn't realize this before I started integrating with
>> v2.32.0 (which I should have done before submitting v5) that
>> make_transient_cache_entry() has changed its prototype to
>> include a memory pool parameter.
> 
> Sorry for the trouble---these are usually all known to me for topics
> I happened to have picked up in 'seen', since I try to make it a rule
> that 'seen' must be a descendant of 'master'.
> 
> How can I usefully communicate the conflicts I find out during the
> integration cycles to topic owners, I wonder.

This is my fault for stacking topics. I used a GitGitGadget PR
to target a custom merge of other topics in flight, so my
merges were testing against a static target. When those topics
were merged, I should have updated my PR to point to 'master' or
even 'next'.

Thanks,
-Stolee
Elijah Newren June 9, 2021, 5:47 a.m. UTC | #4
On Mon, Jun 7, 2021 at 5:34 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> While comparing an index to a tree, we may see a sparse directory entry.
> In this case, we should compare that portion of the tree to the tree
> represented by that entry. This could include a new tree which needs to
> be expanded to a full list of added files. It could also include an
> existing tree, in which case all of the changes inside are important to
> describe, including the modifications, additions, and deletions. Note
> that the case where the tree has a path and the index does not remains
> identical to before: the lack of a cache entry is the same with a sparse
> index.
>
> In the case where a tree is modified, we need to expand the tree
> recursively, and start comparing each contained entry as either an
> addition, deletion, or modification. This causes an interesting
> recursion that did not exist before.

So, I haven't read through this in detail yet...but there's a big
question I'm curious about:

Git already has code for comparing an index to a tree, a tree to a
tree, or a tree to the working directory, right?  So, when comparing a
sparse-index to a tree...can't we re-use the compare a tree to a tree
code when we hit a sparse directory?

Maybe there's a really good reason to conceptually duplicate the
compare a tree to a tree code, but it seems the commit message should
at least address that reason and why we need to reimplement that
logic.


> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  diff-lib.c | 188 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 188 insertions(+)
>
> diff --git a/diff-lib.c b/diff-lib.c
> index b73cc1859a49..ba4c683d4bc4 100644
> --- a/diff-lib.c
> +++ b/diff-lib.c
> @@ -314,6 +314,48 @@ static int get_stat_data(const struct cache_entry *ce,
>         return 0;
>  }
>
> +struct show_new_tree_context {
> +       struct rev_info *revs;
> +       unsigned added:1;
> +};
> +
> +static int show_new_file_from_tree(const struct object_id *oid,
> +                                  struct strbuf *base, const char *path,
> +                                  unsigned int mode, void *context)
> +{
> +       struct show_new_tree_context *ctx = context;
> +       struct cache_entry *new_file = make_transient_cache_entry(mode, oid, path, /* stage */ 0);
> +
> +       diff_index_show_file(ctx->revs, ctx->added ? "+" : "-", new_file, oid, !is_null_oid(oid), mode, 0);
> +       discard_cache_entry(new_file);
> +       return 0;
> +}
> +
> +static void show_directory(struct rev_info *revs,
> +                          const struct cache_entry *new_dir,
> +                          int added)
> +{
> +       /*
> +        * new_dir is a sparse directory entry, so we want to collect all
> +        * of the new files within the tree. This requires recursively
> +        * expanding the trees.
> +        */
> +       struct show_new_tree_context ctx = { revs, added };
> +       struct repository *r = revs->repo;
> +       struct strbuf base = STRBUF_INIT;
> +       struct pathspec ps;
> +       struct tree *tree = lookup_tree(r, &new_dir->oid);
> +
> +       memset(&ps, 0, sizeof(ps));
> +       ps.recursive = 1;
> +       ps.has_wildcard = 1;
> +       ps.max_depth = -1;
> +
> +       strbuf_add(&base, new_dir->name, new_dir->ce_namelen);
> +       read_tree_at(r, tree, &base, &ps,
> +                       show_new_file_from_tree, &ctx);
> +}
> +
>  static void show_new_file(struct rev_info *revs,
>                           const struct cache_entry *new_file,
>                           int cached, int match_missing)
> @@ -322,6 +364,11 @@ static void show_new_file(struct rev_info *revs,
>         unsigned int mode;
>         unsigned dirty_submodule = 0;
>
> +       if (new_file && S_ISSPARSEDIR(new_file->ce_mode)) {
> +               show_directory(revs, new_file, /*added */ 1);
> +               return;
> +       }
> +
>         /*
>          * New file in the index: it might actually be different in
>          * the working tree.
> @@ -333,6 +380,136 @@ static void show_new_file(struct rev_info *revs,
>         diff_index_show_file(revs, "+", new_file, oid, !is_null_oid(oid), mode, dirty_submodule);
>  }
>
> +static int show_modified(struct rev_info *revs,
> +                        const struct cache_entry *old_entry,
> +                        const struct cache_entry *new_entry,
> +                        int report_missing,
> +                        int cached, int match_missing);
> +
> +static int compare_within_sparse_dir(int n, unsigned long mask,
> +                                    unsigned long dirmask, struct name_entry *entry,
> +                                    struct traverse_info *info)
> +{
> +       struct rev_info *revs = info->data;
> +       struct object_id *oid0 = &entry[0].oid;
> +       struct object_id *oid1 = &entry[1].oid;
> +
> +       if (oideq(oid0, oid1))
> +               return mask;
> +
> +       /* Directory/file conflicts are handled earlier. */
> +       if (S_ISDIR(entry[0].mode) && S_ISDIR(entry[1].mode)) {
> +               struct tree_desc t[2];
> +               void *buf[2];
> +               struct traverse_info info_r = { NULL, };
> +
> +               info_r.name = xstrfmt("%s%s", info->traverse_path, entry[0].path);
> +               info_r.namelen = strlen(info_r.name);
> +               info_r.traverse_path = xstrfmt("%s/", info_r.name);
> +               info_r.fn = compare_within_sparse_dir;
> +               info_r.prev = info;
> +               info_r.mode = entry[0].mode;
> +               info_r.pathlen = entry[0].pathlen;
> +               info_r.df_conflicts = 0;
> +               info_r.data = revs;
> +
> +               buf[0] = fill_tree_descriptor(revs->repo, &t[0], oid0);
> +               buf[1] = fill_tree_descriptor(revs->repo, &t[1], oid1);
> +
> +               traverse_trees(NULL, 2, t, &info_r);
> +
> +               free((char *)info_r.name);
> +               free((char *)info_r.traverse_path);
> +               free(buf[0]);
> +               free(buf[1]);
> +       } else {
> +               char *old_path = NULL, *new_path = NULL;
> +               struct cache_entry *old_entry = NULL, *new_entry = NULL;
> +
> +               if (entry[0].path) {
> +                       old_path = xstrfmt("%s%s", info->traverse_path, entry[0].path);
> +                       old_entry = make_transient_cache_entry(
> +                                       entry[0].mode, &entry[0].oid,
> +                                       old_path, /* stage */ 0);
> +                       old_entry->ce_flags |= CE_SKIP_WORKTREE;
> +               }
> +               if (entry[1].path) {
> +                       new_path = xstrfmt("%s%s", info->traverse_path, entry[1].path);
> +                       new_entry = make_transient_cache_entry(
> +                                       entry[1].mode, &entry[1].oid,
> +                                       new_path, /* stage */ 0);
> +                       new_entry->ce_flags |= CE_SKIP_WORKTREE;
> +               }
> +
> +               if (entry[0].path && entry[1].path)
> +                       show_modified(revs, old_entry, new_entry, 0, 1, 0);
> +               else if (entry[0].path)
> +                       diff_index_show_file(revs, revs->prefix,
> +                                            old_entry, &entry[0].oid,
> +                                            0, entry[0].mode, 0);
> +               else if (entry[1].path)
> +                       show_new_file(revs, new_entry, 1, 0);
> +
> +               discard_cache_entry(old_entry);
> +               discard_cache_entry(new_entry);
> +               free(old_path);
> +               free(new_path);
> +       }
> +
> +       return mask;
> +}
> +
> +static void show_modified_sparse_directory(struct rev_info *revs,
> +                        const struct cache_entry *old_entry,
> +                        const struct cache_entry *new_entry,
> +                        int report_missing,
> +                        int cached, int match_missing)
> +{
> +       struct tree_desc t[2];
> +       void *buf[2];
> +       struct traverse_info info = { NULL };
> +       struct strbuf name = STRBUF_INIT;
> +       struct strbuf parent_path = STRBUF_INIT;
> +       char *last_dir_sep;
> +
> +       if (oideq(&old_entry->oid, &new_entry->oid))
> +               return;
> +
> +       info.fn = compare_within_sparse_dir;
> +       info.prev = &info;
> +
> +       strbuf_add(&name, new_entry->name, new_entry->ce_namelen - 1);
> +       info.name = name.buf;
> +       info.namelen = name.len;
> +
> +       strbuf_add(&parent_path, new_entry->name, new_entry->ce_namelen - 1);
> +       if ((last_dir_sep = find_last_dir_sep(parent_path.buf)) > parent_path.buf)
> +               strbuf_setlen(&parent_path, (last_dir_sep - parent_path.buf) - 1);
> +       else
> +               strbuf_setlen(&parent_path, 0);
> +
> +       info.pathlen = parent_path.len;
> +
> +       if (parent_path.len)
> +               info.traverse_path = parent_path.buf;
> +       else
> +               info.traverse_path = "";
> +
> +       info.mode = new_entry->ce_mode;
> +       info.df_conflicts = 0;
> +       info.data = revs;
> +
> +       buf[0] = fill_tree_descriptor(revs->repo, &t[0], &old_entry->oid);
> +       buf[1] = fill_tree_descriptor(revs->repo, &t[1], &new_entry->oid);
> +
> +       traverse_trees(NULL, 2, t, &info);
> +
> +       free(buf[0]);
> +       free(buf[1]);
> +       strbuf_release(&name);
> +       strbuf_release(&parent_path);
> +}
> +
>  static int show_modified(struct rev_info *revs,
>                          const struct cache_entry *old_entry,
>                          const struct cache_entry *new_entry,
> @@ -343,6 +520,17 @@ static int show_modified(struct rev_info *revs,
>         const struct object_id *oid;
>         unsigned dirty_submodule = 0;
>
> +       /*
> +        * If both are sparse directory entries, then expand the
> +        * modifications to the file level.
> +        */
> +       if (old_entry && new_entry &&
> +           S_ISSPARSEDIR(old_entry->ce_mode) &&
> +           S_ISSPARSEDIR(new_entry->ce_mode)) {
> +               show_modified_sparse_directory(revs, old_entry, new_entry, report_missing, cached, match_missing);
> +               return 0;
> +       }
> +
>         if (get_stat_data(new_entry, &oid, &mode, cached, match_missing,
>                           &dirty_submodule, &revs->diffopt) < 0) {
>                 if (report_missing)
> --
> gitgitgadget
>
Junio C Hamano June 9, 2021, 6:32 a.m. UTC | #5
Elijah Newren <newren@gmail.com> writes:

> On Mon, Jun 7, 2021 at 5:34 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> While comparing an index to a tree, we may see a sparse directory entry.
>> In this case, we should compare that portion of the tree to the tree
>> represented by that entry. This could include a new tree which needs to
>> be expanded to a full list of added files. It could also include an
>> existing tree, in which case all of the changes inside are important to
>> describe, including the modifications, additions, and deletions. Note
>> that the case where the tree has a path and the index does not remains
>> identical to before: the lack of a cache entry is the same with a sparse
>> index.
>>
>> In the case where a tree is modified, we need to expand the tree
>> recursively, and start comparing each contained entry as either an
>> addition, deletion, or modification. This causes an interesting
>> recursion that did not exist before.
>
> So, I haven't read through this in detail yet...but there's a big
> question I'm curious about:
>
> Git already has code for comparing an index to a tree, a tree to a
> tree, or a tree to the working directory, right?  So, when comparing a
> sparse-index to a tree...can't we re-use the compare a tree to a tree
> code when we hit a sparse directory?

Offhand I do not think of a reason why that cannot work.

The tree-diff machinery takes two trees, walks them in parallel and
repeatedly calls either diff_addremove() or diff_change(), which
appends diff_filepair() to the diff_queue[] structure.  If you see
an unexpanded tree on the index side, you should be able to pass
that tree with the subtree you are comparing against to the tree-diff
machinery to come up with a series of filepairs, and then tweak the
pathnames of these filepairs (as such a two-tree comparison would be
comparing two trees representing a single subdirectory of two different
vintages) before adding them to the diff_queue[] you are collecting
the index-vs-tree diff, for example.

But if a part of the index is represented as a tree because it is
outside the cone of interest, should we even be showing the
difference in that part of the tree?  If t/ directory is outside the
cone of interest, should "git diff HEAD~100 HEAD t/" show anything
to begin with (the same question for "git diff --cached HEAD t/")?
Elijah Newren June 9, 2021, 8:11 a.m. UTC | #6
On Tue, Jun 8, 2021 at 11:32 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> > On Mon, Jun 7, 2021 at 5:34 AM Derrick Stolee via GitGitGadget
> > <gitgitgadget@gmail.com> wrote:
> >>
> >> From: Derrick Stolee <dstolee@microsoft.com>
> >>
> >> While comparing an index to a tree, we may see a sparse directory entry.
> >> In this case, we should compare that portion of the tree to the tree
> >> represented by that entry. This could include a new tree which needs to
> >> be expanded to a full list of added files. It could also include an
> >> existing tree, in which case all of the changes inside are important to
> >> describe, including the modifications, additions, and deletions. Note
> >> that the case where the tree has a path and the index does not remains
> >> identical to before: the lack of a cache entry is the same with a sparse
> >> index.
> >>
> >> In the case where a tree is modified, we need to expand the tree
> >> recursively, and start comparing each contained entry as either an
> >> addition, deletion, or modification. This causes an interesting
> >> recursion that did not exist before.
> >
> > So, I haven't read through this in detail yet...but there's a big
> > question I'm curious about:
> >
> > Git already has code for comparing an index to a tree, a tree to a
> > tree, or a tree to the working directory, right?  So, when comparing a
> > sparse-index to a tree...can't we re-use the compare a tree to a tree
> > code when we hit a sparse directory?
>
> Offhand I do not think of a reason why that cannot work.
>
> The tree-diff machinery takes two trees, walks them in parallel and
> repeatedly calls either diff_addremove() or diff_change(), which
> appends diff_filepair() to the diff_queue[] structure.  If you see
> an unexpanded tree on the index side, you should be able to pass
> that tree with the subtree you are comparing against to the tree-diff
> machinery to come up with a series of filepairs, and then tweak the
> pathnames of these filepairs (as such a two-tree comparison would be
> comparing two trees representing a single subdirectory of two different
> vintages) before adding them to the diff_queue[] you are collecting
> the index-vs-tree diff, for example.

Good to know it seems my idea might be reasonable.

> But if a part of the index is represented as a tree because it is
> outside the cone of interest, should we even be showing the
> difference in that part of the tree?  If t/ directory is outside the
> cone of interest, should "git diff HEAD~100 HEAD t/" show anything
> to begin with (the same question for "git diff --cached HEAD t/")?

Excellent question...and not just for diff, but log, grep with
revisions, and other commands.  We discussed this a while back[1] and
we seemed to lean towards eventually adding a flag because there are
usecases both for (1) viewing full history while having sparsity paths
restrict just the working copy, and (2) also restricting the view of
history to the sparsity paths.

[1] It's been discussed a few times, but there's a relatively
comprehensive discussion at the "Commands that would change for
behavior A" section from
https://lore.kernel.org/git/CABPp-BGJ_Nvi5TmgriD9Bh6eNXE2EDq2f8e8QKXAeYG3BxZafA@mail.gmail.com/
Derrick Stolee June 9, 2021, 8:33 p.m. UTC | #7
On 6/9/2021 4:11 AM, Elijah Newren wrote:
> On Tue, Jun 8, 2021 at 11:32 PM Junio C Hamano <gitster@pobox.com> wrote:
>>
>> Elijah Newren <newren@gmail.com> writes:
>>
>>> On Mon, Jun 7, 2021 at 5:34 AM Derrick Stolee via GitGitGadget
>>> <gitgitgadget@gmail.com> wrote:
>>>>
>>>> From: Derrick Stolee <dstolee@microsoft.com>
>>>>
>>>> While comparing an index to a tree, we may see a sparse directory entry.
>>>> In this case, we should compare that portion of the tree to the tree
>>>> represented by that entry. This could include a new tree which needs to
>>>> be expanded to a full list of added files. It could also include an
>>>> existing tree, in which case all of the changes inside are important to
>>>> describe, including the modifications, additions, and deletions. Note
>>>> that the case where the tree has a path and the index does not remains
>>>> identical to before: the lack of a cache entry is the same with a sparse
>>>> index.
>>>>
>>>> In the case where a tree is modified, we need to expand the tree
>>>> recursively, and start comparing each contained entry as either an
>>>> addition, deletion, or modification. This causes an interesting
>>>> recursion that did not exist before.
>>>
>>> So, I haven't read through this in detail yet...but there's a big
>>> question I'm curious about:
>>>
>>> Git already has code for comparing an index to a tree, a tree to a
>>> tree, or a tree to the working directory, right?  So, when comparing a
>>> sparse-index to a tree...can't we re-use the compare a tree to a tree
>>> code when we hit a sparse directory?
>>
>> Offhand I do not think of a reason why that cannot work.
>>
>> The tree-diff machinery takes two trees, walks them in parallel and
>> repeatedly calls either diff_addremove() or diff_change(), which
>> appends diff_filepair() to the diff_queue[] structure.  If you see
>> an unexpanded tree on the index side, you should be able to pass
>> that tree with the subtree you are comparing against to the tree-diff
>> machinery to come up with a series of filepairs, and then tweak the
>> pathnames of these filepairs (as such a two-tree comparison would be
>> comparing two trees representing a single subdirectory of two different
>> vintages) before adding them to the diff_queue[] you are collecting
>> the index-vs-tree diff, for example.
> 
> Good to know it seems my idea might be reasonable.

I agree that this is reasonable. I just didn't look hard enough
to find existing code for this, since I found traverse_trees and
thought that _was_ the library for this.

>> But if a part of the index is represented as a tree because it is
>> outside the cone of interest, should we even be showing the
>> difference in that part of the tree?  If t/ directory is outside the
>> cone of interest, should "git diff HEAD~100 HEAD t/" show anything
>> to begin with (the same question for "git diff --cached HEAD t/")?
> 
> Excellent question...and not just for diff, but log, grep with
> revisions, and other commands.  We discussed this a while back[1] and
> we seemed to lean towards eventually adding a flag because there are
> usecases both for (1) viewing full history while having sparsity paths
> restrict just the working copy, and (2) also restricting the view of
> history to the sparsity paths.
> 
> [1] It's been discussed a few times, but there's a relatively
> comprehensive discussion at the "Commands that would change for
> behavior A" section from
> https://lore.kernel.org/git/CABPp-BGJ_Nvi5TmgriD9Bh6eNXE2EDq2f8e8QKXAeYG3BxZafA@mail.gmail.com/

Yes, we could investigate this behavior change in the future. The
good thing is that these points that handle sparse directory
entries create clear branching points for that future behavior
change.

Thanks,
-Stolee
Derrick Stolee June 10, 2021, 5:45 p.m. UTC | #8
On 6/9/2021 4:33 PM, Derrick Stolee wrote:
> On 6/9/2021 4:11 AM, Elijah Newren wrote:
>> On Tue, Jun 8, 2021 at 11:32 PM Junio C Hamano <gitster@pobox.com> wrote:
>>>
>>> Elijah Newren <newren@gmail.com> writes:
>>>
>>> The tree-diff machinery takes two trees, walks them in parallel and
>>> repeatedly calls either diff_addremove() or diff_change(), which
>>> appends diff_filepair() to the diff_queue[] structure.  If you see
>>> an unexpanded tree on the index side, you should be able to pass
>>> that tree with the subtree you are comparing against to the tree-diff
>>> machinery to come up with a series of filepairs, and then tweak the
>>> pathnames of these filepairs (as such a two-tree comparison would be
>>> comparing two trees representing a single subdirectory of two different
>>> vintages) before adding them to the diff_queue[] you are collecting
>>> the index-vs-tree diff, for example.
>>
>> Good to know it seems my idea might be reasonable.
> 
> I agree that this is reasonable. I just didn't look hard enough
> to find existing code for this, since I found traverse_trees and
> thought that _was_ the library for this.

This was surprisingly simple, since most of the complicated stuff
is built into diff_tree_oid() and its use of revs->diffopt. The
new patch works as shown below the cut-line.

I was incredibly suspicious of how quickly this came together,
but it passes all the tests I have for it (including Scalar
functional tests with the commit, checkout, and add integrations).

I'll send a new version with this patch tomorrow, as well as the
other recommended edits.

Thanks,
-Stolee

--- >8 ---


diff --git a/diff-lib.c b/diff-lib.c
index c2ac9250fe9..b631df89343 100644
--- a/diff-lib.c
+++ b/diff-lib.c
@@ -316,6 +316,13 @@ static int get_stat_data(const struct index_state *istate,
 	return 0;
 }
 
+static void show_directory(struct rev_info *revs,
+			   const struct cache_entry *new_dir,
+			   int added)
+{
+	diff_tree_oid(NULL, &new_dir->oid, new_dir->name, &revs->diffopt);
+}
+
 static void show_new_file(struct rev_info *revs,
 			  const struct cache_entry *new_file,
 			  int cached, int match_missing)
@@ -325,6 +332,11 @@ static void show_new_file(struct rev_info *revs,
 	unsigned dirty_submodule = 0;
 	struct index_state *istate = revs->diffopt.repo->index;
 
+	if (new_file && S_ISSPARSEDIR(new_file->ce_mode)) {
+		show_directory(revs, new_file, /*added */ 1);
+		return;
+	}
+
 	/*
 	 * New file in the index: it might actually be different in
 	 * the working tree.
@@ -336,6 +348,15 @@ static void show_new_file(struct rev_info *revs,
 	diff_index_show_file(revs, "+", new_file, oid, !is_null_oid(oid), mode, dirty_submodule);
 }
 
+static void show_modified_sparse_directory(struct rev_info *revs,
+			 const struct cache_entry *old_entry,
+			 const struct cache_entry *new_entry,
+			 int report_missing,
+			 int cached, int match_missing)
+{
+	diff_tree_oid(&old_entry->oid, &new_entry->oid, new_entry->name, &revs->diffopt);
+}
+
 static int show_modified(struct rev_info *revs,
 			 const struct cache_entry *old_entry,
 			 const struct cache_entry *new_entry,
@@ -347,6 +368,17 @@ static int show_modified(struct rev_info *revs,
 	unsigned dirty_submodule = 0;
 	struct index_state *istate = revs->diffopt.repo->index;
 
+	/*
+	 * If both are sparse directory entries, then expand the
+	 * modifications to the file level.
+	 */
+	if (old_entry && new_entry &&
+	    S_ISSPARSEDIR(old_entry->ce_mode) &&
+	    S_ISSPARSEDIR(new_entry->ce_mode)) {
+		show_modified_sparse_directory(revs, old_entry, new_entry, report_missing, cached, match_missing);
+		return 0;
+	}
+
 	if (get_stat_data(istate, new_entry, &oid, &mode, cached, match_missing,
 			  &dirty_submodule, &revs->diffopt) < 0) {
 		if (report_missing)
Elijah Newren June 10, 2021, 9:31 p.m. UTC | #9
On Thu, Jun 10, 2021 at 10:45 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 6/9/2021 4:33 PM, Derrick Stolee wrote:
> > On 6/9/2021 4:11 AM, Elijah Newren wrote:
> >> On Tue, Jun 8, 2021 at 11:32 PM Junio C Hamano <gitster@pobox.com> wrote:
> >>>
> >>> Elijah Newren <newren@gmail.com> writes:
> >>>
> >>> The tree-diff machinery takes two trees, walks them in parallel and
> >>> repeatedly calls either diff_addremove() or diff_change(), which
> >>> appends diff_filepair() to the diff_queue[] structure.  If you see
> >>> an unexpanded tree on the index side, you should be able to pass
> >>> that tree with the subtree you are comparing against to the tree-diff
> >>> machinery to come up with a series of filepairs, and then tweak the
> >>> pathnames of these filepairs (as such a two-tree comparison would be
> >>> comparing two trees representing a single subdirectory of two different
> >>> vintages) before adding them to the diff_queue[] you are collecting
> >>> the index-vs-tree diff, for example.
> >>
> >> Good to know it seems my idea might be reasonable.
> >
> > I agree that this is reasonable. I just didn't look hard enough
> > to find existing code for this, since I found traverse_trees and
> > thought that _was_ the library for this.
>
> This was surprisingly simple, since most of the complicated stuff
> is built into diff_tree_oid() and its use of revs->diffopt. The
> new patch works as shown below the cut-line.
>
> I was incredibly suspicious of how quickly this came together,
> but it passes all the tests I have for it (including Scalar
> functional tests with the commit, checkout, and add integrations).

Nice!

> I'll send a new version with this patch tomorrow, as well as the
> other recommended edits.
>
> Thanks,
> -Stolee
>
> --- >8 ---
>
>
> diff --git a/diff-lib.c b/diff-lib.c
> index c2ac9250fe9..b631df89343 100644
> --- a/diff-lib.c
> +++ b/diff-lib.c
> @@ -316,6 +316,13 @@ static int get_stat_data(const struct index_state *istate,
>         return 0;
>  }
>
> +static void show_directory(struct rev_info *revs,
> +                          const struct cache_entry *new_dir,
> +                          int added)
> +{
> +       diff_tree_oid(NULL, &new_dir->oid, new_dir->name, &revs->diffopt);
> +}
> +
>  static void show_new_file(struct rev_info *revs,
>                           const struct cache_entry *new_file,
>                           int cached, int match_missing)
> @@ -325,6 +332,11 @@ static void show_new_file(struct rev_info *revs,
>         unsigned dirty_submodule = 0;
>         struct index_state *istate = revs->diffopt.repo->index;
>
> +       if (new_file && S_ISSPARSEDIR(new_file->ce_mode)) {
> +               show_directory(revs, new_file, /*added */ 1);
> +               return;
> +       }
> +
>         /*
>          * New file in the index: it might actually be different in
>          * the working tree.
> @@ -336,6 +348,15 @@ static void show_new_file(struct rev_info *revs,
>         diff_index_show_file(revs, "+", new_file, oid, !is_null_oid(oid), mode, dirty_submodule);
>  }
>
> +static void show_modified_sparse_directory(struct rev_info *revs,
> +                        const struct cache_entry *old_entry,
> +                        const struct cache_entry *new_entry,
> +                        int report_missing,
> +                        int cached, int match_missing)
> +{
> +       diff_tree_oid(&old_entry->oid, &new_entry->oid, new_entry->name, &revs->diffopt);
> +}
> +
>  static int show_modified(struct rev_info *revs,
>                          const struct cache_entry *old_entry,
>                          const struct cache_entry *new_entry,
> @@ -347,6 +368,17 @@ static int show_modified(struct rev_info *revs,
>         unsigned dirty_submodule = 0;
>         struct index_state *istate = revs->diffopt.repo->index;
>
> +       /*
> +        * If both are sparse directory entries, then expand the
> +        * modifications to the file level.
> +        */
> +       if (old_entry && new_entry &&
> +           S_ISSPARSEDIR(old_entry->ce_mode) &&
> +           S_ISSPARSEDIR(new_entry->ce_mode)) {
> +               show_modified_sparse_directory(revs, old_entry, new_entry, report_missing, cached, match_missing);
> +               return 0;
> +       }

What if S_ISSPARSEDIR(old_entry->ce_mode) != S_ISSPARSEDIR(new_entry->ce_mode) ?

> +
>         if (get_stat_data(istate, new_entry, &oid, &mode, cached, match_missing,
>                           &dirty_submodule, &revs->diffopt) < 0) {
>                 if (report_missing)
Derrick Stolee June 11, 2021, 12:57 p.m. UTC | #10
On 6/10/2021 5:31 PM, Elijah Newren wrote:
> On Thu, Jun 10, 2021 at 10:45 AM Derrick Stolee <stolee@gmail.com> wrote:
>>
>> On 6/9/2021 4:33 PM, Derrick Stolee wrote:
>>> On 6/9/2021 4:11 AM, Elijah Newren wrote:
>>>> On Tue, Jun 8, 2021 at 11:32 PM Junio C Hamano <gitster@pobox.com> wrote:
>>>>>
>>>>> Elijah Newren <newren@gmail.com> writes:
>>>>>
>>>>> The tree-diff machinery takes two trees, walks them in parallel and
>>>>> repeatedly calls either diff_addremove() or diff_change(), which
>>>>> appends diff_filepair() to the diff_queue[] structure.  If you see
>>>>> an unexpanded tree on the index side, you should be able to pass
>>>>> that tree with the subtree you are comparing against to the tree-diff
>>>>> machinery to come up with a series of filepairs, and then tweak the
>>>>> pathnames of these filepairs (as such a two-tree comparison would be
>>>>> comparing two trees representing a single subdirectory of two different
>>>>> vintages) before adding them to the diff_queue[] you are collecting
>>>>> the index-vs-tree diff, for example.
>>>>
>>>> Good to know it seems my idea might be reasonable.
>>>
>>> I agree that this is reasonable. I just didn't look hard enough
>>> to find existing code for this, since I found traverse_trees and
>>> thought that _was_ the library for this.
>>
>> This was surprisingly simple, since most of the complicated stuff
>> is built into diff_tree_oid() and its use of revs->diffopt. The
>> new patch works as shown below the cut-line.
>>
>> I was incredibly suspicious of how quickly this came together,
>> but it passes all the tests I have for it (including Scalar
>> functional tests with the commit, checkout, and add integrations).
> 
> Nice!
> 
>> I'll send a new version with this patch tomorrow, as well as the
>> other recommended edits.

...still planning on this today, but...

>> +       /*
>> +        * If both are sparse directory entries, then expand the
>> +        * modifications to the file level.
>> +        */
>> +       if (old_entry && new_entry &&
>> +           S_ISSPARSEDIR(old_entry->ce_mode) &&
>> +           S_ISSPARSEDIR(new_entry->ce_mode)) {
>> +               show_modified_sparse_directory(revs, old_entry, new_entry, report_missing, cached, match_missing);
>> +               return 0;
>> +       }
> 
> What if S_ISSPARSEDIR(old_entry->ce_mode) != S_ISSPARSEDIR(new_entry->ce_mode) ?

You make a good point that something different would happen
in the case of a directory/file conflict on the sparse checkout
boundary. This can be as simple as the trivial "only files at
root" cone-mode sparse-checkout definition, with "folder/" (tree)
changing to "folder" (blob).

I'll see what I can do to create a test scenario for
this and add the correct cases.

Thanks,
-Stolee
Derrick Stolee June 11, 2021, 5:27 p.m. UTC | #11
On 6/11/2021 8:57 AM, Derrick Stolee wrote:
> On 6/10/2021 5:31 PM, Elijah Newren wrote:
>> On Thu, Jun 10, 2021 at 10:45 AM Derrick Stolee <stolee@gmail.com> wrote:
>>>
>>> I'll send a new version with this patch tomorrow, as well as the
>>> other recommended edits.
> 
> ...still planning on this today, but...

So optimistic!
 
>>> +       /*
>>> +        * If both are sparse directory entries, then expand the
>>> +        * modifications to the file level.
>>> +        */
>>> +       if (old_entry && new_entry &&
>>> +           S_ISSPARSEDIR(old_entry->ce_mode) &&
>>> +           S_ISSPARSEDIR(new_entry->ce_mode)) {
>>> +               show_modified_sparse_directory(revs, old_entry, new_entry, report_missing, cached, match_missing);
>>> +               return 0;
>>> +       }
>>
>> What if S_ISSPARSEDIR(old_entry->ce_mode) != S_ISSPARSEDIR(new_entry->ce_mode) ?
> 
> You make a good point that something different would happen
> in the case of a directory/file conflict on the sparse checkout
> boundary. This can be as simple as the trivial "only files at
> root" cone-mode sparse-checkout definition, with "folder/" (tree)
> changing to "folder" (blob).
> 
> I'll see what I can do to create a test scenario for
> this and add the correct cases.

Creating a directory/file conflict in this way exposes a bug in
a different codepath in unpack_trees(), although it isn't visible
until 'git checkout' allows the index to stay sparse. It's due to
the code in unpack_callback() that handles blobs and trees
differently, and hence the blob/tree conflict isn't handled
appropriately there. The changes from Patch 8 are to blame for
these first errors.

At least, those are the first errors I have discovered with these
conflicts. There might be other scenarios that care about this
section of diff-lib.c, but I have not gotten to a point where
such behavior would be exposed.

I don't expect to succeed in squashing this bug today, so I'll
try again next week.

Thanks,
-Stolee
diff mbox series

Patch

diff --git a/diff-lib.c b/diff-lib.c
index b73cc1859a49..ba4c683d4bc4 100644
--- a/diff-lib.c
+++ b/diff-lib.c
@@ -314,6 +314,48 @@  static int get_stat_data(const struct cache_entry *ce,
 	return 0;
 }
 
+struct show_new_tree_context {
+	struct rev_info *revs;
+	unsigned added:1;
+};
+
+static int show_new_file_from_tree(const struct object_id *oid,
+				   struct strbuf *base, const char *path,
+				   unsigned int mode, void *context)
+{
+	struct show_new_tree_context *ctx = context;
+	struct cache_entry *new_file = make_transient_cache_entry(mode, oid, path, /* stage */ 0);
+
+	diff_index_show_file(ctx->revs, ctx->added ? "+" : "-", new_file, oid, !is_null_oid(oid), mode, 0);
+	discard_cache_entry(new_file);
+	return 0;
+}
+
+static void show_directory(struct rev_info *revs,
+			   const struct cache_entry *new_dir,
+			   int added)
+{
+	/*
+	 * new_dir is a sparse directory entry, so we want to collect all
+	 * of the new files within the tree. This requires recursively
+	 * expanding the trees.
+	 */
+	struct show_new_tree_context ctx = { revs, added };
+	struct repository *r = revs->repo;
+	struct strbuf base = STRBUF_INIT;
+	struct pathspec ps;
+	struct tree *tree = lookup_tree(r, &new_dir->oid);
+
+	memset(&ps, 0, sizeof(ps));
+	ps.recursive = 1;
+	ps.has_wildcard = 1;
+	ps.max_depth = -1;
+
+	strbuf_add(&base, new_dir->name, new_dir->ce_namelen);
+	read_tree_at(r, tree, &base, &ps,
+			show_new_file_from_tree, &ctx);
+}
+
 static void show_new_file(struct rev_info *revs,
 			  const struct cache_entry *new_file,
 			  int cached, int match_missing)
@@ -322,6 +364,11 @@  static void show_new_file(struct rev_info *revs,
 	unsigned int mode;
 	unsigned dirty_submodule = 0;
 
+	if (new_file && S_ISSPARSEDIR(new_file->ce_mode)) {
+		show_directory(revs, new_file, /*added */ 1);
+		return;
+	}
+
 	/*
 	 * New file in the index: it might actually be different in
 	 * the working tree.
@@ -333,6 +380,136 @@  static void show_new_file(struct rev_info *revs,
 	diff_index_show_file(revs, "+", new_file, oid, !is_null_oid(oid), mode, dirty_submodule);
 }
 
+static int show_modified(struct rev_info *revs,
+			 const struct cache_entry *old_entry,
+			 const struct cache_entry *new_entry,
+			 int report_missing,
+			 int cached, int match_missing);
+
+static int compare_within_sparse_dir(int n, unsigned long mask,
+				     unsigned long dirmask, struct name_entry *entry,
+				     struct traverse_info *info)
+{
+	struct rev_info *revs = info->data;
+	struct object_id *oid0 = &entry[0].oid;
+	struct object_id *oid1 = &entry[1].oid;
+
+	if (oideq(oid0, oid1))
+		return mask;
+
+	/* Directory/file conflicts are handled earlier. */
+	if (S_ISDIR(entry[0].mode) && S_ISDIR(entry[1].mode)) {
+		struct tree_desc t[2];
+		void *buf[2];
+		struct traverse_info info_r = { NULL, };
+
+		info_r.name = xstrfmt("%s%s", info->traverse_path, entry[0].path);
+		info_r.namelen = strlen(info_r.name);
+		info_r.traverse_path = xstrfmt("%s/", info_r.name);
+		info_r.fn = compare_within_sparse_dir;
+		info_r.prev = info;
+		info_r.mode = entry[0].mode;
+		info_r.pathlen = entry[0].pathlen;
+		info_r.df_conflicts = 0;
+		info_r.data = revs;
+
+		buf[0] = fill_tree_descriptor(revs->repo, &t[0], oid0);
+		buf[1] = fill_tree_descriptor(revs->repo, &t[1], oid1);
+
+		traverse_trees(NULL, 2, t, &info_r);
+
+		free((char *)info_r.name);
+		free((char *)info_r.traverse_path);
+		free(buf[0]);
+		free(buf[1]);
+	} else {
+		char *old_path = NULL, *new_path = NULL;
+		struct cache_entry *old_entry = NULL, *new_entry = NULL;
+
+		if (entry[0].path) {
+			old_path = xstrfmt("%s%s", info->traverse_path, entry[0].path);
+			old_entry = make_transient_cache_entry(
+					entry[0].mode, &entry[0].oid,
+					old_path, /* stage */ 0);
+			old_entry->ce_flags |= CE_SKIP_WORKTREE;
+		}
+		if (entry[1].path) {
+			new_path = xstrfmt("%s%s", info->traverse_path, entry[1].path);
+			new_entry = make_transient_cache_entry(
+					entry[1].mode, &entry[1].oid,
+					new_path, /* stage */ 0);
+			new_entry->ce_flags |= CE_SKIP_WORKTREE;
+		}
+
+		if (entry[0].path && entry[1].path)
+			show_modified(revs, old_entry, new_entry, 0, 1, 0);
+		else if (entry[0].path)
+			diff_index_show_file(revs, revs->prefix,
+					     old_entry, &entry[0].oid,
+					     0, entry[0].mode, 0);
+		else if (entry[1].path)
+			show_new_file(revs, new_entry, 1, 0);
+
+		discard_cache_entry(old_entry);
+		discard_cache_entry(new_entry);
+		free(old_path);
+		free(new_path);
+	}
+
+	return mask;
+}
+
+static void show_modified_sparse_directory(struct rev_info *revs,
+			 const struct cache_entry *old_entry,
+			 const struct cache_entry *new_entry,
+			 int report_missing,
+			 int cached, int match_missing)
+{
+	struct tree_desc t[2];
+	void *buf[2];
+	struct traverse_info info = { NULL };
+	struct strbuf name = STRBUF_INIT;
+	struct strbuf parent_path = STRBUF_INIT;
+	char *last_dir_sep;
+
+	if (oideq(&old_entry->oid, &new_entry->oid))
+		return;
+
+	info.fn = compare_within_sparse_dir;
+	info.prev = &info;
+
+	strbuf_add(&name, new_entry->name, new_entry->ce_namelen - 1);
+	info.name = name.buf;
+	info.namelen = name.len;
+
+	strbuf_add(&parent_path, new_entry->name, new_entry->ce_namelen - 1);
+	if ((last_dir_sep = find_last_dir_sep(parent_path.buf)) > parent_path.buf)
+		strbuf_setlen(&parent_path, (last_dir_sep - parent_path.buf) - 1);
+	else
+		strbuf_setlen(&parent_path, 0);
+
+	info.pathlen = parent_path.len;
+
+	if (parent_path.len)
+		info.traverse_path = parent_path.buf;
+	else
+		info.traverse_path = "";
+
+	info.mode = new_entry->ce_mode;
+	info.df_conflicts = 0;
+	info.data = revs;
+
+	buf[0] = fill_tree_descriptor(revs->repo, &t[0], &old_entry->oid);
+	buf[1] = fill_tree_descriptor(revs->repo, &t[1], &new_entry->oid);
+
+	traverse_trees(NULL, 2, t, &info);
+
+	free(buf[0]);
+	free(buf[1]);
+	strbuf_release(&name);
+	strbuf_release(&parent_path);
+}
+
 static int show_modified(struct rev_info *revs,
 			 const struct cache_entry *old_entry,
 			 const struct cache_entry *new_entry,
@@ -343,6 +520,17 @@  static int show_modified(struct rev_info *revs,
 	const struct object_id *oid;
 	unsigned dirty_submodule = 0;
 
+	/*
+	 * If both are sparse directory entries, then expand the
+	 * modifications to the file level.
+	 */
+	if (old_entry && new_entry &&
+	    S_ISSPARSEDIR(old_entry->ce_mode) &&
+	    S_ISSPARSEDIR(new_entry->ce_mode)) {
+		show_modified_sparse_directory(revs, old_entry, new_entry, report_missing, cached, match_missing);
+		return 0;
+	}
+
 	if (get_stat_data(new_entry, &oid, &mode, cached, match_missing,
 			  &dirty_submodule, &revs->diffopt) < 0) {
 		if (report_missing)