[00/10,RFC] In-tree sparse-checkout definitions

Message ID	pull.627.git.1588857462.gitgitgadget@gmail.com (mailing list archive)
Headers	show Return-Path: <SRS0=p769=6V=vger.kernel.org=git-owner@kernel.org> Message-Id: <pull.627.git.1588857462.gitgitgadget@gmail.com> From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> Date: Thu, 07 May 2020 13:17:32 +0000 Subject: [PATCH 00/10] [RFC] In-tree sparse-checkout definitions Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 To: git@vger.kernel.org Cc: newren@gmaill.com, peff@peff.net, me@ttaylorr.com, jrnieder@gmail.com, Derrick Stolee <dstolee@microsoft.com> Sender: git-owner@vger.kernel.org Precedence: bulk
Series	In-tree sparse-checkout definitions \| expand [00/10,RFC] In-tree sparse-checkout definitions [01/10] unpack-trees: avoid array out-of-bounds error [02/10] sparse-checkout: move code from builtin [03/10] sparse-checkout: move code from unpack-trees.c [04/10] sparse-checkout: allow in-tree definitions [05/10] sparse-checkout: automatically update in-tree definition [06/10] sparse-checkout: use oidset to prevent repeat blobs [07/10] sparse-checkout: define in-tree dependencies [08/10] Makefile: skip git-gui if dir is missing [09/10] Makefile: disable GETTEXT when 'po' is missing [10/10] .sparse: add in-tree sparse-checkout for Git

Linus Arver via GitGitGadget May 7, 2020, 1:17 p.m. UTC

This is based on ds/sparse-allow-empty-working-tree.

BACKGROUND
==========

After discussing the future of sparse-checkout at the Git Contributor
Summit, Minh Tai caught me during the break to introduce a very interesting
idea: in-tree sparse-checkout definitions.

Here is the context: I was talking about how the Office team is building a
tool to update the sparse-checkout definition dynamically as their build
definitions change. They have a very compartmentalized build system, and
certain "projects" depend on each other. To ensure that builds do not break
as dependencies change, the plan was to have this tool run as a hook when
HEAD changes. This is a bit expensive to do, and we were discussing how much
we need this to be automated versus a part of the build process.

This kind of idea would need to be replicated for every team that wants to
use sparse-checkout in this way. That doesn't bode well for wide adoption of
the feature.

IN-TREE SPARSE-CHECKOUT DEFINITIONS
===================================

Minh's idea was simple: have sparse-checkout files in the working directory
and use config to point to them. As these in-tree files update, we can
automatically update the sparse-checkout definition accordingly. Now, the
only thing to do would be to ensure that the sparse-checkout files are
updated when someone updates the build definitions. This requires some extra
build validation, but would not require special tools built on every client.

As soon as I sat down to implement this idea, I discovered some
complications to that basic idea. Each of these sparse-checkout definitions
should fit a "role" on the team. To continue with the Office analogy,
suppose we had a definition for developers working on each of Word,
PowerPoint, and Excel. What happens when someone wants to port a feature
from Excel to Word? That user would want to take the union of the two
sparse-checkout definitions. But what does that mean when we are working
with arbitrary sparse-checkout files?

This leads to the restriction that I built into this feature: we only care
about "cone mode" definitions. These rely on directory-based prefix matches
that are very fast. With this restriction, it is simple to understand what
the "union" operation should do: take all directories that would be included
by either.

Once we restrict to cone mode, there is no reason to continue storing files
using the sparse-checkout file format. The list of patterns is larger than
the input to "git sparse-checkout set" or output from "git sparse-checkout
list" in cone mode, and more prone to user error. Thus, let's be simpler and
use a list of directories to specify the sparse-checkout definition.

This leads to the second issue that complicates things. If a build
dependency changes to a core library, but there are N "roles" that depend on
that library, then the person changing that library also needs to change and
validate N sparse-checkout definitions! This does not scale well. So, let's
take a page from build dependencies and allow these in-tree sparse-checkout
definitions to depend on each other!

The end result is a mechanism that should be flexible enough to future needs
but simple enough to build using existing Git features. The format re-uses
the config file format. This provides good re-use of code as well as being
something easy to extend in the future. In Patch 4, I tried to describe
several alternatives and why this is the best of those options. I'm open to
suggestions here.

OUTLINE
=======

* Patch 1 is a bugfix that should work on its own. I caught it in tests
after Patch 5.

* Patches 2-7 implement the feature.

* Patches 8-9 make the Git build system a bit more robust to missing
directories.

* Patch 10 adds a .sparse directory to Git as well as two in-tree
sparse-checkout definitions. One is for a bare-bones Linux developer and
the other adds the compat code for Windows on top.

As mentioned in patch 10, this allows the following workflow for Git
contributors that want to "eat our own dogfood" with the partial clone and
sparse-checkout features:

$ git clone --sparse --filter=blob:none https://github.com/git/git$ cd git
$ git sparse-checkout set --in-tree .sparse/base.deps

Then, we have a smaller working directory but can still build and test the
code.

FUTURE DIRECTIONS
=================

There are definitely ways to make this feature more seamless, or more useful
to us.

* It would be nice to extend git clone to have a string value to
--sparse=<file> and possibly have it multi-valued. This would initialize
the in-tree sparse-checkout definition immediately upon cloning.

* I think there are some ways we could reorganize the Git codebase to make
our "most basic" sparse-checkout file have even fewer files. Mostly, it
would require modifying the build to ask "do these files exist on-disk?
if not, then disable this build option." This would also require adding
test dependencies that also disable some tests when those do not exist.

* Currently, if we have an in-tree sparse-checkout definition, then we
can't add more directories to it manually. If someone runs git
sparse-checkout set --in-tree <file> followed by git sparse-checkout add
<dir>, then will disappear as it is overridden by the in-tree definition.
The only way I can think to get around this would be to add these
directories through Git config, since we were using the sparse-checkout
file as the storage of "what directories do I care about?" but it is no
longer the source of truth in this situation. I'm open to ideas here, and
it would be nice to have this interaction hardened before moving the
series out of RFC status.

Thanks, -Stolee

Derrick Stolee (10):
unpack-trees: avoid array out-of-bounds error
sparse-checkout: move code from builtin
sparse-checkout: move code from unpack-trees.c
sparse-checkout: allow in-tree definitions
sparse-checkout: automatically update in-tree definition
sparse-checkout: use oidset to prevent repeat blobs
sparse-checkout: define in-tree dependencies
Makefile: skip git-gui if dir is missing
Makefile: disable GETTEXT when 'po' is missing
.sparse: add in-tree sparse-checkout for Git

.sparse/base.deps | 19 ++
.sparse/windows.deps | 3 +
Documentation/config.txt | 2 +
Documentation/config/sparse.txt | 15 +
Documentation/git-sparse-checkout.txt | 70 ++++-
Makefile | 19 +-
builtin/commit.c | 4 +-
builtin/read-tree.c | 4 +
builtin/sparse-checkout.c | 268 +++--------------
read-cache.c | 8 +-
sparse-checkout.c | 418 ++++++++++++++++++++++++++
sparse-checkout.h | 30 ++
t/lib-gettext.sh | 1 -
t/t0200-gettext-basic.sh | 6 +-
t/t1091-sparse-checkout-builtin.sh | 143 +++++++++
t/test-lib.sh | 7 +-
unpack-trees.c | 10 +-
17 files changed, 785 insertions(+), 242 deletions(-)
create mode 100644 .sparse/base.deps
create mode 100644 .sparse/windows.deps
create mode 100644 Documentation/config/sparse.txt
create mode 100644 sparse-checkout.c
create mode 100644 sparse-checkout.h

base-commit: ace224ac5fb120e9cae894e31713ab60e91f141f
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-627%2Fderrickstolee%2Fsparse-checkout-in-tree-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-627/derrickstolee/sparse-checkout-in-tree-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/627

Elijah Newren May 20, 2020, 5:38 p.m. UTC | #1

On Thu, May 7, 2020 at 6:20 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> This is based on ds/sparse-allow-empty-working-tree.
>
> BACKGROUND
> ==========
>
> After discussing the future of sparse-checkout at the Git Contributor
> Summit, Minh Tai caught me during the break to introduce a very interesting
> idea: in-tree sparse-checkout definitions.
>
> Here is the context: I was talking about how the Office team is building a
> tool to update the sparse-checkout definition dynamically as their build
> definitions change. They have a very compartmentalized build system, and
> certain "projects" depend on each other. To ensure that builds do not break
> as dependencies change, the plan was to have this tool run as a hook when
> HEAD changes. This is a bit expensive to do, and we were discussing how much
> we need this to be automated versus a part of the build process.
>
> This kind of idea would need to be replicated for every team that wants to
> use sparse-checkout in this way. That doesn't bode well for wide adoption of
> the feature.
>
> IN-TREE SPARSE-CHECKOUT DEFINITIONS
> ===================================
>
> Minh's idea was simple: have sparse-checkout files in the working directory
> and use config to point to them. As these in-tree files update, we can
> automatically update the sparse-checkout definition accordingly. Now, the
> only thing to do would be to ensure that the sparse-checkout files are
> updated when someone updates the build definitions. This requires some extra
> build validation, but would not require special tools built on every client.

I'm not sure how well having a parallel dependency structure will be
received, but the idea of auto-updating does seem really nice.  (I put
a hook in the build system that compares the timestamps on the files
that define dependency structures to the .git/info/sparse-checkout
file, and re-run `git sparse-checkout reapply`, which helps with
updating but is still delayed from the checkout/merge/rebase that
brought in new dependencies.)

> As soon as I sat down to implement this idea, I discovered some
> complications to that basic idea. Each of these sparse-checkout definitions
> should fit a "role" on the team. To continue with the Office analogy,
> suppose we had a definition for developers working on each of Word,
> PowerPoint, and Excel. What happens when someone wants to port a feature
> from Excel to Word? That user would want to take the union of the two
> sparse-checkout definitions. But what does that mean when we are working
> with arbitrary sparse-checkout files?
>
> This leads to the restriction that I built into this feature: we only care
> about "cone mode" definitions. These rely on directory-based prefix matches
> that are very fast. With this restriction, it is simple to understand what
> the "union" operation should do: take all directories that would be included
> by either.

Seems like you're adopting my idea of pushing folks towards "cone
mode" and limiting support for non-code modes[1], eh?  ;-)

I'm totally on board.  :-)

[1] https://lore.kernel.org/git/CABPp-BHKYR9QBcAG_pM6DniaeGS40X=ErKGGsvWa_KhogCZzEA@mail.gmail.com/

> Once we restrict to cone mode, there is no reason to continue storing files
> using the sparse-checkout file format. The list of patterns is larger than
> the input to "git sparse-checkout set" or output from "git sparse-checkout
> list" in cone mode, and more prone to user error. Thus, let's be simpler and
> use a list of directories to specify the sparse-checkout definition.

Ahah, you did come around to using a new config file[2].  ;-)

I like having simpler config files; sounds great.

[2] https://lore.kernel.org/git/c919513a-f41f-2ce8-60ed-e0b0733c0c7f@gmail.com/
 ("another config file")

> This leads to the second issue that complicates things. If a build
> dependency changes to a core library, but there are N "roles" that depend on
> that library, then the person changing that library also needs to change and
> validate N sparse-checkout definitions! This does not scale well. So, let's
> take a page from build dependencies and allow these in-tree sparse-checkout
> definitions to depend on each other!
>
> The end result is a mechanism that should be flexible enough to future needs
> but simple enough to build using existing Git features. The format re-uses
> the config file format. This provides good re-use of code as well as being
> something easy to extend in the future. In Patch 4, I tried to describe
> several alternatives and why this is the best of those options. I'm open to
> suggestions here.
>
> OUTLINE
> =======
>
>  * Patch 1 is a bugfix that should work on its own. I caught it in tests
>    after Patch 5.
>
>
>  * Patches 2-7 implement the feature.
>
>
>  * Patches 8-9 make the Git build system a bit more robust to missing
>    directories.
>
>
>  * Patch 10 adds a .sparse directory to Git as well as two in-tree
>    sparse-checkout definitions. One is for a bare-bones Linux developer and
>    the other adds the compat code for Windows on top.
>
>
>
> As mentioned in patch 10, this allows the following workflow for Git
> contributors that want to "eat our own dogfood" with the partial clone and
> sparse-checkout features:
>
>  $ git clone --sparse --filter=blob:none https://github.com/git/git
> $ cd git
> $ git sparse-checkout set --in-tree .sparse/base.deps
>
> Then, we have a smaller working directory but can still build and test the
> code.
>
> FUTURE DIRECTIONS
> =================
>
> There are definitely ways to make this feature more seamless, or more useful
> to us.
>
>  * It would be nice to extend git clone to have a string value to
>    --sparse=<file> and possibly have it multi-valued. This would initialize
>    the in-tree sparse-checkout definition immediately upon cloning.

Also in terms of future directions, I would like to have the option
for these to be passed to --filter instead of using --filter:none.

And on top of that, it may also make sense for users to be able to
fetch cones of history somehow.  For example, if they started with
`git clone --sparse=featureA ...` and later wanted to add the
'featureB' cone, it'd be nice to be able to run some kind of `git
fetch --sparse=featureB ...` command when they know they are on a good
network to download the history of objects that touched the featureB
cone instead of relying on the normal lazy fetching of partial clones.

This kind of thing could be especially useful if, say, we were in a
pandemic and everyone was forced to work from home.  But it'd also be
useful if you just had people travelling a lot and stuck in third
world places with low connectivity, like CalTrain.  ;-)

>  * I think there are some ways we could reorganize the Git codebase to make
>    our "most basic" sparse-checkout file have even fewer files. Mostly, it
>    would require modifying the build to ask "do these files exist on-disk?
>    if not, then disable this build option." This would also require adding
>    test dependencies that also disable some tests when those do not exist.
>
>
>  * Currently, if we have an in-tree sparse-checkout definition, then we
>    can't add more directories to it manually. If someone runs git
>    sparse-checkout set --in-tree <file> followed by git sparse-checkout add
>    <dir>, then will disappear as it is overridden by the in-tree definition.
>    The only way I can think to get around this would be to add these
>    directories through Git config, since we were using the sparse-checkout
>    file as the storage of "what directories do I care about?" but it is no
>    longer the source of truth in this situation. I'm open to ideas here, and
>    it would be nice to have this interaction hardened before moving the
>    series out of RFC status.

This seems like a big problem to me.  Actually, it sounds like two big problems.

1) non-extensibility

I suspect most will be happy enough with the in-tree definitions, but
there are always a few users that want to add to the pre-defined
sparsity sets to get some additional directories.  We could use
.gitignore as an example; although people have shared ignore rules in
.gitignore, we still allow user-defined additional ignore rules that
are not meant to be shared.  Perhaps we can allow users to use some
pre-defined additional file (similar to .git/info/exclude) or even
user-defined additional file (similar to core.excludesFile)?

We would still need to be careful to not allow the in-tree definitions
to depend on files that are not in-tree, of course.

(Even with my `sparsify` script which allowed things like `./sparsify
--modules moduleA moduleB` and which walked build system dependencies
to get the right directories, I still allowed additional directories
to be specified within a .git/info/sparse-checkout-user file.  Not
sure I liked the location of the file (it's sometimes hard to find for
users of worktrees), or the basename of the file I randomly chose that
day, but it came in handy for a few people.)

2) Ugly, overly verbose (and even error-prone?) UI.

If `git sparse-checkout set --in-tree <file>` means that any
subsequent checkout will blow apart any user-ran `git sparse-checkout
add <dir>`, then we really need to give at least warning and maybe
even error messages if people are using in-tree mode and use either
set or add.  Also, this feels like we're just about deprecating the
usage of set and add without the --in-tree flag, which could lead to
years of questions of "Why do I have to specify --in-tree every time I
run these commands?"  Can we avoid that somehow?

Maybe we just take advantage of the huge backward compatibility
warning we put in place and require non-in-tree mode to specify an
additional flag?  Maybe we just check the argument to set or add when
in cone mode, and if it's a file (instead of a directory) then we
automatically assume --in-tree (or attempt to parse it first and throw
an error if it isn't in the --in-tree format)?  Maybe we pick a
different verb than add or set ('use') and guide people to it instead
of `add` and `set`?

Also, as a side note, can people specify multiple in-tree files at
once (`git sparse-checkout set --in-tree <file1> <file2>`)?

Elijah Newren June 17, 2020, 11:14 p.m. UTC | #2

Hi,

Another late addition...

On Thu, May 7, 2020 at 6:20 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:

> IN-TREE SPARSE-CHECKOUT DEFINITIONS
> ===================================
>
> Minh's idea was simple: have sparse-checkout files in the working directory
> and use config to point to them. As these in-tree files update, we can
> automatically update the sparse-checkout definition accordingly. Now, the
> only thing to do would be to ensure that the sparse-checkout files are
> updated when someone updates the build definitions. This requires some extra
> build validation, but would not require special tools built on every client.

"In-tree" still bugs me after a few weeks; the wording seems slightly
awkward.  I don't have a good suggestion, but I'm curious if there's a
better term.

But I really came here to comment on another issue I think I glossed
over the first time around.  I'm curious if all module definition
files have to exist in the working directory, as possibly suggested
above, or if we can allow them to just exist in the index.  To give
you a flavor for what I mean, with my sparsify tool people can do
things like:
    ./sparsify --modules MODULE_A
which provides MODULE_A and it's dependencies while removing all other
directories.  If MODULE_B, is not a dependency (direct or transitive)
of MODULE_A, it will not exist in the working directory after this
step.  Our equivalent of the "in-tree" definition of MODULE_B exists
*in* the directory for MODULE_B, because it seems to make sense for
us.  I want people to be able to do
    ./sparsify --modules MODULE_B
and have it correctly check out all the necessary files even though
the definition of MODULE_B wasn't even in the working directory at the
time the command ran.  (The sparsify script knows to check the working
directory first, then fall back to the index).

Derrick Stolee June 18, 2020, 1:42 a.m. UTC | #3

On 6/17/2020 7:14 PM, Elijah Newren wrote:
> Hi,
> 
> Another late addition...
> 
> On Thu, May 7, 2020 at 6:20 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
> 
>> IN-TREE SPARSE-CHECKOUT DEFINITIONS
>> ===================================
>>
>> Minh's idea was simple: have sparse-checkout files in the working directory
>> and use config to point to them. As these in-tree files update, we can
>> automatically update the sparse-checkout definition accordingly. Now, the
>> only thing to do would be to ensure that the sparse-checkout files are
>> updated when someone updates the build definitions. This requires some extra
>> build validation, but would not require special tools built on every client.
> 
> "In-tree" still bugs me after a few weeks; the wording seems slightly
> awkward.  I don't have a good suggestion, but I'm curious if there's a
> better term.

I am open to suggestions. It reminds me of the two hardest problems
in software engineering:

	1. concurrency
	2. naming things
	3. off-by-one errors

> But I really came here to comment on another issue I think I glossed
> over the first time around.  I'm curious if all module definition
> files have to exist in the working directory, as possibly suggested
> above, or if we can allow them to just exist in the index.  To give
> you a flavor for what I mean, with my sparsify tool people can do
> things like:
>     ./sparsify --modules MODULE_A
> which provides MODULE_A and it's dependencies while removing all other
> directories.  If MODULE_B, is not a dependency (direct or transitive)
> of MODULE_A, it will not exist in the working directory after this
> step.  Our equivalent of the "in-tree" definition of MODULE_B exists
> *in* the directory for MODULE_B, because it seems to make sense for
> us.  I want people to be able to do
>     ./sparsify --modules MODULE_B
> and have it correctly check out all the necessary files even though
> the definition of MODULE_B wasn't even in the working directory at the
> time the command ran.  (The sparsify script knows to check the working
> directory first, then fall back to the index).

I think one tricky part of my RFC is that it _only_ looks at the
index. This allows us to read the contents even when the files are
not part of the current sparse-checkout definition.

You mentioned in another thread that it is a bit unwieldy for a user
to rely on a committed (or staged?) file, so adding the ability to
check the working directory first is interesting. I wonder how the
timing comes into play when changing HEAD to a new commit? Seems
tricky, but solvable.

Thanks,
-Stolee

Elijah Newren June 18, 2020, 1:59 a.m. UTC | #4

On Wed, Jun 17, 2020 at 6:42 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 6/17/2020 7:14 PM, Elijah Newren wrote:
> > Hi,
> >
> > Another late addition...
> >
> > On Thu, May 7, 2020 at 6:20 AM Derrick Stolee via GitGitGadget
> > <gitgitgadget@gmail.com> wrote:
> >
> >> IN-TREE SPARSE-CHECKOUT DEFINITIONS
> >> ===================================
> >>
> >> Minh's idea was simple: have sparse-checkout files in the working directory
> >> and use config to point to them. As these in-tree files update, we can
> >> automatically update the sparse-checkout definition accordingly. Now, the
> >> only thing to do would be to ensure that the sparse-checkout files are
> >> updated when someone updates the build definitions. This requires some extra
> >> build validation, but would not require special tools built on every client.
> >
> > "In-tree" still bugs me after a few weeks; the wording seems slightly
> > awkward.  I don't have a good suggestion, but I'm curious if there's a
> > better term.
>
> I am open to suggestions. It reminds me of the two hardest problems
> in software engineering:
>
>         1. concurrency
>         2. naming things
>         3. off-by-one errors

:-)

> > But I really came here to comment on another issue I think I glossed
> > over the first time around.  I'm curious if all module definition
> > files have to exist in the working directory, as possibly suggested
> > above, or if we can allow them to just exist in the index.  To give
> > you a flavor for what I mean, with my sparsify tool people can do
> > things like:
> >     ./sparsify --modules MODULE_A
> > which provides MODULE_A and it's dependencies while removing all other
> > directories.  If MODULE_B, is not a dependency (direct or transitive)
> > of MODULE_A, it will not exist in the working directory after this
> > step.  Our equivalent of the "in-tree" definition of MODULE_B exists
> > *in* the directory for MODULE_B, because it seems to make sense for
> > us.  I want people to be able to do
> >     ./sparsify --modules MODULE_B
> > and have it correctly check out all the necessary files even though
> > the definition of MODULE_B wasn't even in the working directory at the
> > time the command ran.  (The sparsify script knows to check the working
> > directory first, then fall back to the index).
>
> I think one tricky part of my RFC is that it _only_ looks at the
> index. This allows us to read the contents even when the files are
> not part of the current sparse-checkout definition.
>
> You mentioned in another thread that it is a bit unwieldy for a user
> to rely on a committed (or staged?) file, so adding the ability to
> check the working directory first is interesting. I wonder how the
> timing comes into play when changing HEAD to a new commit? Seems
> tricky, but solvable.

Isn't that essentially the same timing issue that comes into play if
you only look at the index, and are changing HEAD to a new commit?

Derrick Stolee June 18, 2020, 3:01 a.m. UTC | #5

On 6/17/2020 9:59 PM, Elijah Newren wrote:
> On Wed, Jun 17, 2020 at 6:42 PM Derrick Stolee <stolee@gmail.com> wrote:
>>
>> On 6/17/2020 7:14 PM, Elijah Newren wrote:
>> You mentioned in another thread that it is a bit unwieldy for a user
>> to rely on a committed (or staged?) file, so adding the ability to
>> check the working directory first is interesting. I wonder how the
>> timing comes into play when changing HEAD to a new commit? Seems
>> tricky, but solvable.
> 
> Isn't that essentially the same timing issue that comes into play if
> you only look at the index, and are changing HEAD to a new commit?

It adds to the complexity. We can inspect the new index for the
in-tree definition and update the skip-worktree bits before actually
changing the working directory to match the new index. However, if
we trust the working directory copy over the index copy, then we
need to also decide if the working directory copy _would_ change
in this move before using it.

Of course, maybe I'm making it over-complicated. I still know so
little about the index. I got into this feature due to a simple
pattern-matching problem that I could understand, but now I'm
getting lost in index states and dirty statuses. ;)

Thanks,
-Stolee

Elijah Newren June 18, 2020, 5:03 a.m. UTC | #6

On Wed, Jun 17, 2020 at 8:01 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 6/17/2020 9:59 PM, Elijah Newren wrote:
> > On Wed, Jun 17, 2020 at 6:42 PM Derrick Stolee <stolee@gmail.com> wrote:
> >>
> >> On 6/17/2020 7:14 PM, Elijah Newren wrote:
> >> You mentioned in another thread that it is a bit unwieldy for a user
> >> to rely on a committed (or staged?) file, so adding the ability to
> >> check the working directory first is interesting. I wonder how the
> >> timing comes into play when changing HEAD to a new commit? Seems
> >> tricky, but solvable.
> >
> > Isn't that essentially the same timing issue that comes into play if
> > you only look at the index, and are changing HEAD to a new commit?
>
> It adds to the complexity. We can inspect the new index for the
> in-tree definition and update the skip-worktree bits before actually
> changing the working directory to match the new index. However, if
> we trust the working directory copy over the index copy, then we
> need to also decide if the working directory copy _would_ change
> in this move before using it.
>
> Of course, maybe I'm making it over-complicated. I still know so
> little about the index. I got into this feature due to a simple
> pattern-matching problem that I could understand, but now I'm
> getting lost in index states and dirty statuses. ;)

Honestly, I think your first cut for switching branches with this new
feature should just be:
  1) Do a switch/checkout exactly the way it's done today already:
    1a) Load the index and existing sparsity rules (from worktree then
index), according to what existed before the branch switch
    1b) Do the appropriate 1-way or 2-way merge from unpack_trees() to
move to the right branch (as builtin/checkout.c already does)
  2) *If* using in-tree sparsity feature, re-load the sparsity rules
(from worktree then index) and run the equivalent of `sparse-checkout
reapply`

Not only does this avoid messing up any code for the normal case, I'm
pretty sure this gets the behavior the user expects in all the special
cases.  There is one big downside, and a little downside.  The big
downside is that it'll have two progress meters instead of just one
(much like sparse-checkout init && sparse-checkout set do today).  The
little downside is that this isn't optimal from a performance
standpoint, for two reasons:
(a) some files may be updated in the working tree in step 1b despite
the fact that they'll be removed in step 2.
(b) step 2 may just be an expensive no-op, in fact I suspect it will
be most the time

The reason I think the performance isn't a big deal is because:
C) the fact that (b) is a problem means (a) is not often an issue --
the sparsity patterns are usually the same.
D) Even if the sparsity patterns differ, it's often the case that
files within the tree won't change (people tend to only modify a few
files per commit, after all, so even switching branches only updates a
subset of all files).
E) Even if the sparsity patterns differ and files differ and have to
be updated, it's still proportional at most to the number of files
selected by the sparsity patterns, so it won't feel excessive or slow
to the user.
F) `sparse-checkout reapply` is pretty cheap, especially if called as
a function instead of as a subprocess

In my opinion, other than the unfortunate dual progress meters, the
only place where this new feature gets hairy is that `git
sparse-checkout reapply` (or its libified variant) now needs to know
how to "read your new form of sparsity rules".  That sounds easy at
first but in my opinion it is a bit of a mess.  The user could have
written garbage to the file and I don't know what to do with garbage,
but I think we have to do something sane.  (Give up and checkout
everything?  Give up and checkout nothing but the toplevel directory?
Ignore each line we don't understand and treat the rest as stuff that
needs to be checked out?  Can that last one even be done with the .ini
reader?)  I don't think saying "well, tough luck the user should know
better" works because we already know that merges or rebases or
checkout -m could have written garbage to those files (in the form of
conflict markers), and users will say that *we* were the ones that
filled the file with garbage.  So we have to handle it intelligently
in some fashion.

Thoughts?

[00/10,RFC] In-tree sparse-checkout definitions

Message

Comments