Message ID | pull.627.git.1588857462.gitgitgadget@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | In-tree sparse-checkout definitions | expand |
On Thu, May 7, 2020 at 6:20 AM Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com> wrote: > > This is based on ds/sparse-allow-empty-working-tree. > > BACKGROUND > ========== > > After discussing the future of sparse-checkout at the Git Contributor > Summit, Minh Tai caught me during the break to introduce a very interesting > idea: in-tree sparse-checkout definitions. > > Here is the context: I was talking about how the Office team is building a > tool to update the sparse-checkout definition dynamically as their build > definitions change. They have a very compartmentalized build system, and > certain "projects" depend on each other. To ensure that builds do not break > as dependencies change, the plan was to have this tool run as a hook when > HEAD changes. This is a bit expensive to do, and we were discussing how much > we need this to be automated versus a part of the build process. > > This kind of idea would need to be replicated for every team that wants to > use sparse-checkout in this way. That doesn't bode well for wide adoption of > the feature. > > IN-TREE SPARSE-CHECKOUT DEFINITIONS > =================================== > > Minh's idea was simple: have sparse-checkout files in the working directory > and use config to point to them. As these in-tree files update, we can > automatically update the sparse-checkout definition accordingly. Now, the > only thing to do would be to ensure that the sparse-checkout files are > updated when someone updates the build definitions. This requires some extra > build validation, but would not require special tools built on every client. I'm not sure how well having a parallel dependency structure will be received, but the idea of auto-updating does seem really nice. (I put a hook in the build system that compares the timestamps on the files that define dependency structures to the .git/info/sparse-checkout file, and re-run `git sparse-checkout reapply`, which helps with updating but is still delayed from the checkout/merge/rebase that brought in new dependencies.) > As soon as I sat down to implement this idea, I discovered some > complications to that basic idea. Each of these sparse-checkout definitions > should fit a "role" on the team. To continue with the Office analogy, > suppose we had a definition for developers working on each of Word, > PowerPoint, and Excel. What happens when someone wants to port a feature > from Excel to Word? That user would want to take the union of the two > sparse-checkout definitions. But what does that mean when we are working > with arbitrary sparse-checkout files? > > This leads to the restriction that I built into this feature: we only care > about "cone mode" definitions. These rely on directory-based prefix matches > that are very fast. With this restriction, it is simple to understand what > the "union" operation should do: take all directories that would be included > by either. Seems like you're adopting my idea of pushing folks towards "cone mode" and limiting support for non-code modes[1], eh? ;-) I'm totally on board. :-) [1] https://lore.kernel.org/git/CABPp-BHKYR9QBcAG_pM6DniaeGS40X=ErKGGsvWa_KhogCZzEA@mail.gmail.com/ > Once we restrict to cone mode, there is no reason to continue storing files > using the sparse-checkout file format. The list of patterns is larger than > the input to "git sparse-checkout set" or output from "git sparse-checkout > list" in cone mode, and more prone to user error. Thus, let's be simpler and > use a list of directories to specify the sparse-checkout definition. Ahah, you did come around to using a new config file[2]. ;-) I like having simpler config files; sounds great. [2] https://lore.kernel.org/git/c919513a-f41f-2ce8-60ed-e0b0733c0c7f@gmail.com/ ("another config file") > This leads to the second issue that complicates things. If a build > dependency changes to a core library, but there are N "roles" that depend on > that library, then the person changing that library also needs to change and > validate N sparse-checkout definitions! This does not scale well. So, let's > take a page from build dependencies and allow these in-tree sparse-checkout > definitions to depend on each other! > > The end result is a mechanism that should be flexible enough to future needs > but simple enough to build using existing Git features. The format re-uses > the config file format. This provides good re-use of code as well as being > something easy to extend in the future. In Patch 4, I tried to describe > several alternatives and why this is the best of those options. I'm open to > suggestions here. > > OUTLINE > ======= > > * Patch 1 is a bugfix that should work on its own. I caught it in tests > after Patch 5. > > > * Patches 2-7 implement the feature. > > > * Patches 8-9 make the Git build system a bit more robust to missing > directories. > > > * Patch 10 adds a .sparse directory to Git as well as two in-tree > sparse-checkout definitions. One is for a bare-bones Linux developer and > the other adds the compat code for Windows on top. > > > > As mentioned in patch 10, this allows the following workflow for Git > contributors that want to "eat our own dogfood" with the partial clone and > sparse-checkout features: > > $ git clone --sparse --filter=blob:none https://github.com/git/git > $ cd git > $ git sparse-checkout set --in-tree .sparse/base.deps > > Then, we have a smaller working directory but can still build and test the > code. > > FUTURE DIRECTIONS > ================= > > There are definitely ways to make this feature more seamless, or more useful > to us. > > * It would be nice to extend git clone to have a string value to > --sparse=<file> and possibly have it multi-valued. This would initialize > the in-tree sparse-checkout definition immediately upon cloning. Also in terms of future directions, I would like to have the option for these to be passed to --filter instead of using --filter:none. And on top of that, it may also make sense for users to be able to fetch cones of history somehow. For example, if they started with `git clone --sparse=featureA ...` and later wanted to add the 'featureB' cone, it'd be nice to be able to run some kind of `git fetch --sparse=featureB ...` command when they know they are on a good network to download the history of objects that touched the featureB cone instead of relying on the normal lazy fetching of partial clones. This kind of thing could be especially useful if, say, we were in a pandemic and everyone was forced to work from home. But it'd also be useful if you just had people travelling a lot and stuck in third world places with low connectivity, like CalTrain. ;-) > * I think there are some ways we could reorganize the Git codebase to make > our "most basic" sparse-checkout file have even fewer files. Mostly, it > would require modifying the build to ask "do these files exist on-disk? > if not, then disable this build option." This would also require adding > test dependencies that also disable some tests when those do not exist. > > > * Currently, if we have an in-tree sparse-checkout definition, then we > can't add more directories to it manually. If someone runs git > sparse-checkout set --in-tree <file> followed by git sparse-checkout add > <dir>, then will disappear as it is overridden by the in-tree definition. > The only way I can think to get around this would be to add these > directories through Git config, since we were using the sparse-checkout > file as the storage of "what directories do I care about?" but it is no > longer the source of truth in this situation. I'm open to ideas here, and > it would be nice to have this interaction hardened before moving the > series out of RFC status. This seems like a big problem to me. Actually, it sounds like two big problems. 1) non-extensibility I suspect most will be happy enough with the in-tree definitions, but there are always a few users that want to add to the pre-defined sparsity sets to get some additional directories. We could use .gitignore as an example; although people have shared ignore rules in .gitignore, we still allow user-defined additional ignore rules that are not meant to be shared. Perhaps we can allow users to use some pre-defined additional file (similar to .git/info/exclude) or even user-defined additional file (similar to core.excludesFile)? We would still need to be careful to not allow the in-tree definitions to depend on files that are not in-tree, of course. (Even with my `sparsify` script which allowed things like `./sparsify --modules moduleA moduleB` and which walked build system dependencies to get the right directories, I still allowed additional directories to be specified within a .git/info/sparse-checkout-user file. Not sure I liked the location of the file (it's sometimes hard to find for users of worktrees), or the basename of the file I randomly chose that day, but it came in handy for a few people.) 2) Ugly, overly verbose (and even error-prone?) UI. If `git sparse-checkout set --in-tree <file>` means that any subsequent checkout will blow apart any user-ran `git sparse-checkout add <dir>`, then we really need to give at least warning and maybe even error messages if people are using in-tree mode and use either set or add. Also, this feels like we're just about deprecating the usage of set and add without the --in-tree flag, which could lead to years of questions of "Why do I have to specify --in-tree every time I run these commands?" Can we avoid that somehow? Maybe we just take advantage of the huge backward compatibility warning we put in place and require non-in-tree mode to specify an additional flag? Maybe we just check the argument to set or add when in cone mode, and if it's a file (instead of a directory) then we automatically assume --in-tree (or attempt to parse it first and throw an error if it isn't in the --in-tree format)? Maybe we pick a different verb than add or set ('use') and guide people to it instead of `add` and `set`? Also, as a side note, can people specify multiple in-tree files at once (`git sparse-checkout set --in-tree <file1> <file2>`)?
Hi, Another late addition... On Thu, May 7, 2020 at 6:20 AM Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com> wrote: > IN-TREE SPARSE-CHECKOUT DEFINITIONS > =================================== > > Minh's idea was simple: have sparse-checkout files in the working directory > and use config to point to them. As these in-tree files update, we can > automatically update the sparse-checkout definition accordingly. Now, the > only thing to do would be to ensure that the sparse-checkout files are > updated when someone updates the build definitions. This requires some extra > build validation, but would not require special tools built on every client. "In-tree" still bugs me after a few weeks; the wording seems slightly awkward. I don't have a good suggestion, but I'm curious if there's a better term. But I really came here to comment on another issue I think I glossed over the first time around. I'm curious if all module definition files have to exist in the working directory, as possibly suggested above, or if we can allow them to just exist in the index. To give you a flavor for what I mean, with my sparsify tool people can do things like: ./sparsify --modules MODULE_A which provides MODULE_A and it's dependencies while removing all other directories. If MODULE_B, is not a dependency (direct or transitive) of MODULE_A, it will not exist in the working directory after this step. Our equivalent of the "in-tree" definition of MODULE_B exists *in* the directory for MODULE_B, because it seems to make sense for us. I want people to be able to do ./sparsify --modules MODULE_B and have it correctly check out all the necessary files even though the definition of MODULE_B wasn't even in the working directory at the time the command ran. (The sparsify script knows to check the working directory first, then fall back to the index).
On 6/17/2020 7:14 PM, Elijah Newren wrote: > Hi, > > Another late addition... > > On Thu, May 7, 2020 at 6:20 AM Derrick Stolee via GitGitGadget > <gitgitgadget@gmail.com> wrote: > >> IN-TREE SPARSE-CHECKOUT DEFINITIONS >> =================================== >> >> Minh's idea was simple: have sparse-checkout files in the working directory >> and use config to point to them. As these in-tree files update, we can >> automatically update the sparse-checkout definition accordingly. Now, the >> only thing to do would be to ensure that the sparse-checkout files are >> updated when someone updates the build definitions. This requires some extra >> build validation, but would not require special tools built on every client. > > "In-tree" still bugs me after a few weeks; the wording seems slightly > awkward. I don't have a good suggestion, but I'm curious if there's a > better term. I am open to suggestions. It reminds me of the two hardest problems in software engineering: 1. concurrency 2. naming things 3. off-by-one errors > But I really came here to comment on another issue I think I glossed > over the first time around. I'm curious if all module definition > files have to exist in the working directory, as possibly suggested > above, or if we can allow them to just exist in the index. To give > you a flavor for what I mean, with my sparsify tool people can do > things like: > ./sparsify --modules MODULE_A > which provides MODULE_A and it's dependencies while removing all other > directories. If MODULE_B, is not a dependency (direct or transitive) > of MODULE_A, it will not exist in the working directory after this > step. Our equivalent of the "in-tree" definition of MODULE_B exists > *in* the directory for MODULE_B, because it seems to make sense for > us. I want people to be able to do > ./sparsify --modules MODULE_B > and have it correctly check out all the necessary files even though > the definition of MODULE_B wasn't even in the working directory at the > time the command ran. (The sparsify script knows to check the working > directory first, then fall back to the index). I think one tricky part of my RFC is that it _only_ looks at the index. This allows us to read the contents even when the files are not part of the current sparse-checkout definition. You mentioned in another thread that it is a bit unwieldy for a user to rely on a committed (or staged?) file, so adding the ability to check the working directory first is interesting. I wonder how the timing comes into play when changing HEAD to a new commit? Seems tricky, but solvable. Thanks, -Stolee
On Wed, Jun 17, 2020 at 6:42 PM Derrick Stolee <stolee@gmail.com> wrote: > > On 6/17/2020 7:14 PM, Elijah Newren wrote: > > Hi, > > > > Another late addition... > > > > On Thu, May 7, 2020 at 6:20 AM Derrick Stolee via GitGitGadget > > <gitgitgadget@gmail.com> wrote: > > > >> IN-TREE SPARSE-CHECKOUT DEFINITIONS > >> =================================== > >> > >> Minh's idea was simple: have sparse-checkout files in the working directory > >> and use config to point to them. As these in-tree files update, we can > >> automatically update the sparse-checkout definition accordingly. Now, the > >> only thing to do would be to ensure that the sparse-checkout files are > >> updated when someone updates the build definitions. This requires some extra > >> build validation, but would not require special tools built on every client. > > > > "In-tree" still bugs me after a few weeks; the wording seems slightly > > awkward. I don't have a good suggestion, but I'm curious if there's a > > better term. > > I am open to suggestions. It reminds me of the two hardest problems > in software engineering: > > 1. concurrency > 2. naming things > 3. off-by-one errors :-) > > But I really came here to comment on another issue I think I glossed > > over the first time around. I'm curious if all module definition > > files have to exist in the working directory, as possibly suggested > > above, or if we can allow them to just exist in the index. To give > > you a flavor for what I mean, with my sparsify tool people can do > > things like: > > ./sparsify --modules MODULE_A > > which provides MODULE_A and it's dependencies while removing all other > > directories. If MODULE_B, is not a dependency (direct or transitive) > > of MODULE_A, it will not exist in the working directory after this > > step. Our equivalent of the "in-tree" definition of MODULE_B exists > > *in* the directory for MODULE_B, because it seems to make sense for > > us. I want people to be able to do > > ./sparsify --modules MODULE_B > > and have it correctly check out all the necessary files even though > > the definition of MODULE_B wasn't even in the working directory at the > > time the command ran. (The sparsify script knows to check the working > > directory first, then fall back to the index). > > I think one tricky part of my RFC is that it _only_ looks at the > index. This allows us to read the contents even when the files are > not part of the current sparse-checkout definition. > > You mentioned in another thread that it is a bit unwieldy for a user > to rely on a committed (or staged?) file, so adding the ability to > check the working directory first is interesting. I wonder how the > timing comes into play when changing HEAD to a new commit? Seems > tricky, but solvable. Isn't that essentially the same timing issue that comes into play if you only look at the index, and are changing HEAD to a new commit?
On 6/17/2020 9:59 PM, Elijah Newren wrote: > On Wed, Jun 17, 2020 at 6:42 PM Derrick Stolee <stolee@gmail.com> wrote: >> >> On 6/17/2020 7:14 PM, Elijah Newren wrote: >> You mentioned in another thread that it is a bit unwieldy for a user >> to rely on a committed (or staged?) file, so adding the ability to >> check the working directory first is interesting. I wonder how the >> timing comes into play when changing HEAD to a new commit? Seems >> tricky, but solvable. > > Isn't that essentially the same timing issue that comes into play if > you only look at the index, and are changing HEAD to a new commit? It adds to the complexity. We can inspect the new index for the in-tree definition and update the skip-worktree bits before actually changing the working directory to match the new index. However, if we trust the working directory copy over the index copy, then we need to also decide if the working directory copy _would_ change in this move before using it. Of course, maybe I'm making it over-complicated. I still know so little about the index. I got into this feature due to a simple pattern-matching problem that I could understand, but now I'm getting lost in index states and dirty statuses. ;) Thanks, -Stolee
On Wed, Jun 17, 2020 at 8:01 PM Derrick Stolee <stolee@gmail.com> wrote: > > On 6/17/2020 9:59 PM, Elijah Newren wrote: > > On Wed, Jun 17, 2020 at 6:42 PM Derrick Stolee <stolee@gmail.com> wrote: > >> > >> On 6/17/2020 7:14 PM, Elijah Newren wrote: > >> You mentioned in another thread that it is a bit unwieldy for a user > >> to rely on a committed (or staged?) file, so adding the ability to > >> check the working directory first is interesting. I wonder how the > >> timing comes into play when changing HEAD to a new commit? Seems > >> tricky, but solvable. > > > > Isn't that essentially the same timing issue that comes into play if > > you only look at the index, and are changing HEAD to a new commit? > > It adds to the complexity. We can inspect the new index for the > in-tree definition and update the skip-worktree bits before actually > changing the working directory to match the new index. However, if > we trust the working directory copy over the index copy, then we > need to also decide if the working directory copy _would_ change > in this move before using it. > > Of course, maybe I'm making it over-complicated. I still know so > little about the index. I got into this feature due to a simple > pattern-matching problem that I could understand, but now I'm > getting lost in index states and dirty statuses. ;) Honestly, I think your first cut for switching branches with this new feature should just be: 1) Do a switch/checkout exactly the way it's done today already: 1a) Load the index and existing sparsity rules (from worktree then index), according to what existed before the branch switch 1b) Do the appropriate 1-way or 2-way merge from unpack_trees() to move to the right branch (as builtin/checkout.c already does) 2) *If* using in-tree sparsity feature, re-load the sparsity rules (from worktree then index) and run the equivalent of `sparse-checkout reapply` Not only does this avoid messing up any code for the normal case, I'm pretty sure this gets the behavior the user expects in all the special cases. There is one big downside, and a little downside. The big downside is that it'll have two progress meters instead of just one (much like sparse-checkout init && sparse-checkout set do today). The little downside is that this isn't optimal from a performance standpoint, for two reasons: (a) some files may be updated in the working tree in step 1b despite the fact that they'll be removed in step 2. (b) step 2 may just be an expensive no-op, in fact I suspect it will be most the time The reason I think the performance isn't a big deal is because: C) the fact that (b) is a problem means (a) is not often an issue -- the sparsity patterns are usually the same. D) Even if the sparsity patterns differ, it's often the case that files within the tree won't change (people tend to only modify a few files per commit, after all, so even switching branches only updates a subset of all files). E) Even if the sparsity patterns differ and files differ and have to be updated, it's still proportional at most to the number of files selected by the sparsity patterns, so it won't feel excessive or slow to the user. F) `sparse-checkout reapply` is pretty cheap, especially if called as a function instead of as a subprocess In my opinion, other than the unfortunate dual progress meters, the only place where this new feature gets hairy is that `git sparse-checkout reapply` (or its libified variant) now needs to know how to "read your new form of sparsity rules". That sounds easy at first but in my opinion it is a bit of a mess. The user could have written garbage to the file and I don't know what to do with garbage, but I think we have to do something sane. (Give up and checkout everything? Give up and checkout nothing but the toplevel directory? Ignore each line we don't understand and treat the rest as stuff that needs to be checked out? Can that last one even be done with the .ini reader?) I don't think saying "well, tough luck the user should know better" works because we already know that merges or rebases or checkout -m could have written garbage to those files (in the form of conflict markers), and users will say that *we* were the ones that filled the file with garbage. So we have to handle it intelligently in some fashion. Thoughts?