diff mbox series

[2/6] diff: ignore sparse paths in diffstat

Message ID 9f50f11d394e46ffbe348a579792c2b683096452.1629220124.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series Sparse Index: Integrate with merge, cherry-pick, rebase, and revert | expand

Commit Message

Derrick Stolee Aug. 17, 2021, 5:08 p.m. UTC
From: Derrick Stolee <dstolee@microsoft.com>

The diff_populate_filespec() method is used to describe the diff after a
merge operation is complete, especially when a conflict appears. In
order to avoid expanding a sparse index, the reuse_worktree_file() needs
to be adapted to ignore files that are outside of the sparse-checkout
cone. The file names and OIDs used for this check come from the merged
tree in the case of the ORT strategy, not the index, hence the ability
to look into these paths without having already expanded the index.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 diff.c | 8 ++++++++
 1 file changed, 8 insertions(+)

Comments

Elijah Newren Aug. 20, 2021, 9:32 p.m. UTC | #1
On Tue, Aug 17, 2021 at 10:08 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The diff_populate_filespec() method is used to describe the diff after a
> merge operation is complete, especially when a conflict appears. In
> order to avoid expanding a sparse index, the reuse_worktree_file() needs
> to be adapted to ignore files that are outside of the sparse-checkout
> cone. The file names and OIDs used for this check come from the merged
> tree in the case of the ORT strategy, not the index, hence the ability
> to look into these paths without having already expanded the index.

I'm confused; I thought the diffstat was only shown if the merge was
successful, in which case there would be no conflicts appearing.

Also, I'm not that familiar with the general diff machinery (just the
rename detection parts), but...if the diffstat only shows when the
merge is successful, wouldn't we be comparing two REVS (ORIG_HEAD to
HEAD)?  Why would we make use of the working tree at all in such a
case?  And, wouldn't using the working tree be dangerous...what if
there was a merge performed with a dirty working tree?

On a bit of a tangent, I know diffcore-rename.c calls into
diff_populate_filespec() as well, and I have some code doing merges in
a bare repository (where there obviously is no index).  It seemed to
be working, but given this commit message, now I'm wondering if I've
missed something fundamental either in that implementation or there's
something amiss in this patch.  Or both.  Maybe I need to dig into
diff_populate_filespec() more, but it seems you already have.  Any
pointers to orient me on why your changes are right here (and, if you
know, why diffcore-rename.c should or shouldn't be using
diff_populate_filespec() in a bare repo)?


> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  diff.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
>
> diff --git a/diff.c b/diff.c
> index a8113f17070..c457cfa0e59 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -26,6 +26,7 @@
>  #include "parse-options.h"
>  #include "help.h"
>  #include "promisor-remote.h"
> +#include "dir.h"
>
>  #ifdef NO_FAST_WORKING_DIRECTORY
>  #define FAST_WORKING_DIRECTORY 0
> @@ -3900,6 +3901,13 @@ static int reuse_worktree_file(struct index_state *istate,
>         if (!FAST_WORKING_DIRECTORY && !want_file && has_object_pack(oid))
>                 return 0;
>
> +       /*
> +        * If this path does not match our sparse-checkout definition,
> +        * then the file will not be in the working directory.
> +        */
> +       if (!path_in_sparse_checkout(name, istate))
> +               return 0;
> +
>         /*
>          * Similarly, if we'd have to convert the file contents anyway, that
>          * makes the optimization not worthwhile.
> --
> gitgitgadget
>
Derrick Stolee Aug. 24, 2021, 6:30 p.m. UTC | #2
On 8/20/2021 5:32 PM, Elijah Newren wrote:
> On Tue, Aug 17, 2021 at 10:08 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> The diff_populate_filespec() method is used to describe the diff after a
>> merge operation is complete, especially when a conflict appears. In
>> order to avoid expanding a sparse index, the reuse_worktree_file() needs
>> to be adapted to ignore files that are outside of the sparse-checkout
>> cone. The file names and OIDs used for this check come from the merged
>> tree in the case of the ORT strategy, not the index, hence the ability
>> to look into these paths without having already expanded the index.
> 
> I'm confused; I thought the diffstat was only shown if the merge was
> successful, in which case there would be no conflicts appearing.

That's my mistake. I'll edit the message accordingly.
 
> Also, I'm not that familiar with the general diff machinery (just the
> rename detection parts), but...if the diffstat only shows when the
> merge is successful, wouldn't we be comparing two REVS (ORIG_HEAD to
> HEAD)?  Why would we make use of the working tree at all in such a
> case?  And, wouldn't using the working tree be dangerous...what if
> there was a merge performed with a dirty working tree?
> 
> On a bit of a tangent, I know diffcore-rename.c calls into
> diff_populate_filespec() as well, and I have some code doing merges in
> a bare repository (where there obviously is no index).  It seemed to
> be working, but given this commit message, now I'm wondering if I've
> missed something fundamental either in that implementation or there's
> something amiss in this patch.  Or both.  Maybe I need to dig into
> diff_populate_filespec() more, but it seems you already have.  Any
> pointers to orient me on why your changes are right here (and, if you
> know, why diffcore-rename.c should or shouldn't be using
> diff_populate_filespec() in a bare repo)?

I think the cases you are thinking about are covered by this
condition before the one I'm inserting:

	/* We want to avoid the working directory if our caller
	 * doesn't need the data in a normal file, this system
	 * is rather slow with its stat/open/mmap/close syscalls,
	 * and the object is contained in a pack file.  The pack
	 * is probably already open and will be faster to obtain
	 * the data through than the working directory.  Loose
	 * objects however would tend to be slower as they need
	 * to be individually opened and inflated.
	 */
	if (!FAST_WORKING_DIRECTORY && !want_file && has_object_pack(oid))
		return 0;

or after:

	/*
	 * Similarly, if we'd have to convert the file contents anyway, that
	 * makes the optimization not worthwhile.
	 */
	if (!want_file && would_convert_to_git(istate, name))
		return 0;

(This makes me think that I should move my new condition further down
so these two can be linked by context.)

Sounds like this is just an optimization, so it is fine to opt out of it
if we think the optimization isn't necessary. Outside of the sparse-checkout
cone qualifies.

Thanks,
-Stolee
Elijah Newren Aug. 27, 2021, 10:27 p.m. UTC | #3
On Tue, Aug 24, 2021 at 11:30 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 8/20/2021 5:32 PM, Elijah Newren wrote:
> > On Tue, Aug 17, 2021 at 10:08 AM Derrick Stolee via GitGitGadget
> > <gitgitgadget@gmail.com> wrote:
> >>
> >> From: Derrick Stolee <dstolee@microsoft.com>
> >>
> >> The diff_populate_filespec() method is used to describe the diff after a
> >> merge operation is complete, especially when a conflict appears. In
> >> order to avoid expanding a sparse index, the reuse_worktree_file() needs
> >> to be adapted to ignore files that are outside of the sparse-checkout
> >> cone. The file names and OIDs used for this check come from the merged
> >> tree in the case of the ORT strategy, not the index, hence the ability
> >> to look into these paths without having already expanded the index.
> >
> > I'm confused; I thought the diffstat was only shown if the merge was
> > successful, in which case there would be no conflicts appearing.
>
> That's my mistake. I'll edit the message accordingly.
>
> > Also, I'm not that familiar with the general diff machinery (just the
> > rename detection parts), but...if the diffstat only shows when the
> > merge is successful, wouldn't we be comparing two REVS (ORIG_HEAD to
> > HEAD)?  Why would we make use of the working tree at all in such a
> > case?  And, wouldn't using the working tree be dangerous...what if
> > there was a merge performed with a dirty working tree?
> >
> > On a bit of a tangent, I know diffcore-rename.c calls into
> > diff_populate_filespec() as well, and I have some code doing merges in
> > a bare repository (where there obviously is no index).  It seemed to
> > be working, but given this commit message, now I'm wondering if I've
> > missed something fundamental either in that implementation or there's
> > something amiss in this patch.  Or both.  Maybe I need to dig into
> > diff_populate_filespec() more, but it seems you already have.  Any
> > pointers to orient me on why your changes are right here (and, if you
> > know, why diffcore-rename.c should or shouldn't be using
> > diff_populate_filespec() in a bare repo)?
>
> I think the cases you are thinking about are covered by this
> condition before the one I'm inserting:
>
>         /* We want to avoid the working directory if our caller
>          * doesn't need the data in a normal file, this system
>          * is rather slow with its stat/open/mmap/close syscalls,
>          * and the object is contained in a pack file.  The pack
>          * is probably already open and will be faster to obtain
>          * the data through than the working directory.  Loose
>          * objects however would tend to be slower as they need
>          * to be individually opened and inflated.
>          */
>         if (!FAST_WORKING_DIRECTORY && !want_file && has_object_pack(oid))
>                 return 0;
>
> or after:
>
>         /*
>          * Similarly, if we'd have to convert the file contents anyway, that
>          * makes the optimization not worthwhile.
>          */
>         if (!want_file && would_convert_to_git(istate, name))
>                 return 0;
>
> (This makes me think that I should move my new condition further down
> so these two can be linked by context.)
>
> Sounds like this is just an optimization, so it is fine to opt out of it
> if we think the optimization isn't necessary. Outside of the sparse-checkout
> cone qualifies.

Ah, this is very helpful.  Thanks for digging up these details.
diff mbox series

Patch

diff --git a/diff.c b/diff.c
index a8113f17070..c457cfa0e59 100644
--- a/diff.c
+++ b/diff.c
@@ -26,6 +26,7 @@ 
 #include "parse-options.h"
 #include "help.h"
 #include "promisor-remote.h"
+#include "dir.h"
 
 #ifdef NO_FAST_WORKING_DIRECTORY
 #define FAST_WORKING_DIRECTORY 0
@@ -3900,6 +3901,13 @@  static int reuse_worktree_file(struct index_state *istate,
 	if (!FAST_WORKING_DIRECTORY && !want_file && has_object_pack(oid))
 		return 0;
 
+	/*
+	 * If this path does not match our sparse-checkout definition,
+	 * then the file will not be in the working directory.
+	 */
+	if (!path_in_sparse_checkout(name, istate))
+		return 0;
+
 	/*
 	 * Similarly, if we'd have to convert the file contents anyway, that
 	 * makes the optimization not worthwhile.