[RFC,v1] Add virtual file system settings and hook proc
diff mbox series

Message ID 20181030191608.18716-1-peartben@gmail.com
State New
Headers show
Series
  • [RFC,v1] Add virtual file system settings and hook proc
Related show

Commit Message

Ben Peart Oct. 30, 2018, 7:16 p.m. UTC
From: Ben Peart <benpeart@microsoft.com>

On index load, clear/set the skip worktree bits based on the virtual
file system data. Use virtual file system data to update skip-worktree
bit in unpack-trees. Use virtual file system data to exclude files and
folders not explicitly requested.

Signed-off-by: Ben Peart <benpeart@microsoft.com>
---

We have taken several steps to make git perform well on very large repos.
Some of those steps include: improving underlying algorithms, utilizing
multi-threading where possible, and simplifying the behavior of some commands.
These changes typically benefit all git repos to varying degrees.  While
these optimizations all help, they are insufficient to provide adequate
performance on the very large repos we often work with.

To make git perform well on the very largest repos, we had to make more
significant changes.  The biggest performance win by far is the work we have
done to make git operations O(modified) instead of O(size of repo).  This
takes advantage of the fact that the number of files a developer has modified
is a tiny fraction of the overall repo size.

We accomplished this by utilizing the existing internal logic for the skip
worktree bit and excludes to tell git to ignore all files and folders other
than those that have been modified.  This logic is driven by an external
process that monitors writes to the repo and communicates the list of files
and folders with changes to git via the virtual file system hook in this patch.

The external process maintains a list of files and folders that have been
modified.  When git runs, it requests the list of files and folders that
have been modified via the virtual file system hook.  Git then sets/clears
the skip-worktree bit on the cache entries and builds a hashmap of the
modified files/folders that is used by the excludes logic to avoid scanning
the entire repo looking for changes and untracked files.

With this system, we have been able to make local git command performance on
extremely large repos (millions of files, 1/2 million folders) entirely
manageable (30 second checkout, 3.5 seconds status, 4 second add, 7 second
commit, etc).

Our desire is to eliminate all custom patches in our fork of git.  To that
end, I'm submitting this as an RFC to see how much interest there is and how
much willingness to take this type of change into git.

Notes:
    Base Ref: master
    Web-Diff: https://github.com/benpeart/git/commit/c06b290d2f
    Checkout: git fetch https://github.com/benpeart/git virtual-filesystem-v1 && git checkout c06b290d2f

 Documentation/config.txt     |   8 +
 Documentation/githooks.txt   |  20 ++
 Makefile                     |   1 +
 cache.h                      |   1 +
 config.c                     |  37 +++-
 config.h                     |   1 +
 dir.c                        |  34 +++-
 environment.c                |   1 +
 read-cache.c                 |   2 +
 t/README                     |   3 +
 t/t1092-virtualfilesystem.sh | 354 +++++++++++++++++++++++++++++++++++
 t/t1092/virtualfilesystem    |  23 +++
 unpack-trees.c               |  23 ++-
 virtualfilesystem.c          | 308 ++++++++++++++++++++++++++++++
 virtualfilesystem.h          |  25 +++
 15 files changed, 833 insertions(+), 8 deletions(-)
 create mode 100755 t/t1092-virtualfilesystem.sh
 create mode 100755 t/t1092/virtualfilesystem
 create mode 100644 virtualfilesystem.c
 create mode 100644 virtualfilesystem.h


base-commit: 4ede3d42dfb57f9a41ac96a1f216c62eb7566cc2

Comments

Junio C Hamano Oct. 30, 2018, 11:07 p.m. UTC | #1
Ben Peart <peartben@gmail.com> writes:

> diff --git a/config.c b/config.c
> index 4051e38823..96e05ee0f1 100644
> --- a/config.c
> +++ b/config.c
> ...
> @@ -2307,6 +2311,37 @@ int git_config_get_index_threads(void)
>  	return 0; /* auto */
>  }
>  
> +int git_config_get_virtualfilesystem(void)
> +{
> +	if (git_config_get_pathname("core.virtualfilesystem", &core_virtualfilesystem))
> +		core_virtualfilesystem = getenv("GIT_TEST_VIRTUALFILESYSTEM");
> +
> +	if (core_virtualfilesystem && !*core_virtualfilesystem)
> +		core_virtualfilesystem = NULL;
> +
> +	if (core_virtualfilesystem) {
> +		/*
> +		 * Some git commands spawn helpers and redirect the index to a different
> +		 * location.  These include "difftool -d" and the sequencer
> +		 * (i.e. `git rebase -i`, `git cherry-pick` and `git revert`) and others.
> +		 * In those instances we don't want to update their temporary index with
> +		 * our virtualization data.
> +		 */
> +		char *default_index_file = xstrfmt("%s/%s", the_repository->gitdir, "index");
> +		int should_run_hook = !strcmp(default_index_file, the_repository->index_file);
> +
> +		free(default_index_file);
> +		if (should_run_hook) {
> +			/* virtual file system relies on the sparse checkout logic so force it on */
> +			core_apply_sparse_checkout = 1;
> +			return 1;
> +		}
> +		core_virtualfilesystem = NULL;

It would be a small leak if this came from config_get_pathname(),
but if it came from $GIT_TEST_VFS env, we cannot free it X-<.

A helper function called *_get_X() that does not return X as its
return value or updating the location pointed by its *dst parameter,
and instead only stores its finding in a global variable feels
somewhat odd.  It smells more like "find out", "probe", "check",
etc.

> diff --git a/dir.c b/dir.c
> index 47c2fca8dc..3097b0e446 100644
> --- a/dir.c
> +++ b/dir.c
> @@ -21,6 +21,7 @@
>  #include "ewah/ewok.h"
>  #include "fsmonitor.h"
>  #include "submodule-config.h"
> +#include "virtualfilesystem.h"
>  
>  /*
>   * Tells read_directory_recursive how a file or directory should be treated.
> @@ -1109,6 +1110,18 @@ int is_excluded_from_list(const char *pathname,
>  			  struct exclude_list *el, struct index_state *istate)
>  {
>  	struct exclude *exclude;
> +
> +	/*
> +	 * The virtual file system data is used to prevent git from traversing
> +	 * any part of the tree that is not in the virtual file system.  Return
> +	 * 1 to exclude the entry if it is not found in the virtual file system,
> +	 * else fall through to the regular excludes logic as it may further exclude.
> +	 */

This comment will sit better immediately in front of the call to "is
excluded from vfs?" helper function.

> +	if (*dtype == DT_UNKNOWN)
> +		*dtype = get_dtype(NULL, istate, pathname, pathlen);

We try to defer paying cost to determine unknown *dtype as late as
possible by having this call in last_exclude_matching_from_list(),
and not here.  If we are doing this, we probably should update the
callpaths that call last_exclude_matching_from_list() to make the
caller responsible for doing get_dtype() and drop the lazy finding
of dtype from the callee.  Alternatively, the new "is excluded from
vfs" helper can learn to do the lazy get_dtype() just like the
existing last_exclude_matching_from_list() does.  I suspect the
latter may be simpler.

> +	if (is_excluded_from_virtualfilesystem(pathname, pathlen, *dtype) > 0)
> +		return 1;
> +
>  	exclude = last_exclude_matching_from_list(pathname, pathlen, basename,
>  						  dtype, el, istate);
>  	if (exclude)
> @@ -1324,8 +1337,20 @@ struct exclude *last_exclude_matching(struct dir_struct *dir,
>  int is_excluded(struct dir_struct *dir, struct index_state *istate,
>  		const char *pathname, int *dtype_p)
>  {
> -	struct exclude *exclude =
> -		last_exclude_matching(dir, istate, pathname, dtype_p);
> +	struct exclude *exclude;
> +
> +	/*
> +	 * The virtual file system data is used to prevent git from traversing
> +	 * any part of the tree that is not in the virtual file system.  Return
> +	 * 1 to exclude the entry if it is not found in the virtual file system,
> +	 * else fall through to the regular excludes logic as it may further exclude.
> +	 */
> +	if (*dtype_p == DT_UNKNOWN)
> +		*dtype_p = get_dtype(NULL, istate, pathname, strlen(pathname));

Exactly the same comment as above.

> +	if (is_excluded_from_virtualfilesystem(pathname, strlen(pathname), *dtype_p) > 0)
> +		return 1;
> +
> +	exclude = last_exclude_matching(dir, istate, pathname, dtype_p);
>  	if (exclude)
>  		return exclude->flags & EXC_FLAG_NEGATIVE ? 0 : 1;
>  	return 0;
> @@ -1678,6 +1703,9 @@ static enum path_treatment treat_one_path(struct dir_struct *dir,
>  	if (dtype != DT_DIR && has_path_in_index)
>  		return path_none;
>  
> +	if (is_excluded_from_virtualfilesystem(path->buf, path->len, dtype) > 0)
> +		return path_excluded;
> +
>  	/*
>  	 * When we are looking at a directory P in the working tree,
>  	 * there are three cases:
> @@ -2018,6 +2046,8 @@ static enum path_treatment read_directory_recursive(struct dir_struct *dir,
>  		/* add the path to the appropriate result list */
>  		switch (state) {
>  		case path_excluded:
> +			if (is_excluded_from_virtualfilesystem(path.buf, path.len, DT_DIR) > 0)
> +				break;
>  			if (dir->flags & DIR_SHOW_IGNORED)
>  				dir_add_name(dir, istate, path.buf, path.len);
>  			else if ((dir->flags & DIR_SHOW_IGNORED_TOO) ||

I am kind-of surprised that the "damage" to dir.c need to support
this is so isolated and small ;-)

> ...
> +if test "$1" != 1
> +then
> +	echo "Unsupported core.virtualfilesystem hook version." >&2
> +	exit 1
> +fi
> +
> +#find -type f -printf '%P\0'
> +find -type d -printf '%P/\0'

I am not reading (hence not commenting on) tests in this review
message yet.

> diff --git a/unpack-trees.c b/unpack-trees.c
> index 7570df481b..ee3cda2e94 100644
> --- a/unpack-trees.c
> +++ b/unpack-trees.c
> @@ -18,6 +18,7 @@
>  #include "fsmonitor.h"
>  #include "object-store.h"
>  #include "fetch-object.h"
> +#include "virtualfilesystem.h"
>  
>  /*
>   * Error messages expected by scripts out of plumbing commands such as
> @@ -1363,6 +1364,14 @@ static int clear_ce_flags_1(struct index_state *istate,
>  			continue;
>  		}
>  
> +		/* if it's not in the virtual file system, exit early */
> +		if (core_virtualfilesystem) {
> +			if (is_included_in_virtualfilesystem(ce->name, ce->ce_namelen) > 0)
> +				ce->ce_flags &= ~clear_mask;
> +			cache++;
> +			continue;
> +		}

Earlier we saw "is it excluded?" and now we have "is it included?"
They have different function signature (i.e. "included?" does not
need to know the type of the entry), and I am guessing that for the
purpose of this particular patch that may be sufficient, but I have
to wonder if in the longer term we'd be better off to keep the
interface to these two functions similar.  Also, I wonder if we need
both---I see below that these "is included/excluded?" helpers are
allowed to say "I dunno", so that may be the reason why we cannot
simply say "included is the same as !excluded".
Duy Nguyen Oct. 31, 2018, 7:11 p.m. UTC | #2
not really a review, just  a couple quick notes..

On Tue, Oct 30, 2018 at 9:40 PM Ben Peart <peartben@gmail.com> wrote:
>
> From: Ben Peart <benpeart@microsoft.com>
>
> On index load, clear/set the skip worktree bits based on the virtual
> file system data. Use virtual file system data to update skip-worktree
> bit in unpack-trees. Use virtual file system data to exclude files and
> folders not explicitly requested.
>
> Signed-off-by: Ben Peart <benpeart@microsoft.com>
> ---
>
> We have taken several steps to make git perform well on very large repos.
> Some of those steps include: improving underlying algorithms, utilizing
> multi-threading where possible, and simplifying the behavior of some commands.
> These changes typically benefit all git repos to varying degrees.  While
> these optimizations all help, they are insufficient to provide adequate
> performance on the very large repos we often work with.
>
> To make git perform well on the very largest repos, we had to make more
> significant changes.  The biggest performance win by far is the work we have
> done to make git operations O(modified) instead of O(size of repo).  This
> takes advantage of the fact that the number of files a developer has modified
> is a tiny fraction of the overall repo size.
>
> We accomplished this by utilizing the existing internal logic for the skip
> worktree bit and excludes to tell git to ignore all files and folders other
> than those that have been modified.  This logic is driven by an external
> process that monitors writes to the repo and communicates the list of files
> and folders with changes to git via the virtual file system hook in this patch.
>
> The external process maintains a list of files and folders that have been
> modified.  When git runs, it requests the list of files and folders that
> have been modified via the virtual file system hook.  Git then sets/clears
> the skip-worktree bit on the cache entries and builds a hashmap of the
> modified files/folders that is used by the excludes logic to avoid scanning
> the entire repo looking for changes and untracked files.
>
> With this system, we have been able to make local git command performance on
> extremely large repos (millions of files, 1/2 million folders) entirely
> manageable (30 second checkout, 3.5 seconds status, 4 second add, 7 second
> commit, etc).
>
> Our desire is to eliminate all custom patches in our fork of git.  To that
> end, I'm submitting this as an RFC to see how much interest there is and how
> much willingness to take this type of change into git.

Most of these paragraphs (perhaps except the last one) should be part
of the commit message. You describe briefly what the patch does but
it's even more important to say why you want to do it.

> +core.virtualFilesystem::
> +       If set, the value of this variable is used as a command which
> +       will identify all files and directories that are present in
> +       the working directory.  Git will only track and update files
> +       listed in the virtual file system.  Using the virtual file system
> +       will supersede the sparse-checkout settings which will be ignored.
> +       See the "virtual file system" section of linkgit:githooks[6].

It sounds like "virtual file system" is just one of the use cases for
this feature, which is more about a dynamic source of sparse-checkout
bits. Perhaps name the config key with something along sparse checkout
instead of naming it after a use case.

This is a hook. I notice we start to avoid adding real hooks and just
add config keys instead. Eventually we should have config-based hooks,
but if we're going to add more like this, I think these should be in a
separate section, hook.virtualFileSystem or something.

I don't think the superseding makes sense. There's no reason this
could not be used in combination with $GIT_DIR/info/sparse-checkout.
If you don't want both, disable the other.

One last note. Since this is related to filesystem. Shouldn't it be
part of fsmonitor (the protocol, not the implementation)? Then
watchman user could use it to.
Ben Peart Oct. 31, 2018, 8:12 p.m. UTC | #3
On 10/30/2018 7:07 PM, Junio C Hamano wrote:
> Ben Peart <peartben@gmail.com> writes:
> 
>> diff --git a/config.c b/config.c
>> index 4051e38823..96e05ee0f1 100644
>> --- a/config.c
>> +++ b/config.c
>> ...
>> @@ -2307,6 +2311,37 @@ int git_config_get_index_threads(void)
>>   	return 0; /* auto */
>>   }
>>   
>> +int git_config_get_virtualfilesystem(void)
>> +{
>> +	if (git_config_get_pathname("core.virtualfilesystem", &core_virtualfilesystem))
>> +		core_virtualfilesystem = getenv("GIT_TEST_VIRTUALFILESYSTEM");
>> +
>> +	if (core_virtualfilesystem && !*core_virtualfilesystem)
>> +		core_virtualfilesystem = NULL;
>> +
>> +	if (core_virtualfilesystem) {
>> +		/*
>> +		 * Some git commands spawn helpers and redirect the index to a different
>> +		 * location.  These include "difftool -d" and the sequencer
>> +		 * (i.e. `git rebase -i`, `git cherry-pick` and `git revert`) and others.
>> +		 * In those instances we don't want to update their temporary index with
>> +		 * our virtualization data.
>> +		 */
>> +		char *default_index_file = xstrfmt("%s/%s", the_repository->gitdir, "index");
>> +		int should_run_hook = !strcmp(default_index_file, the_repository->index_file);
>> +
>> +		free(default_index_file);
>> +		if (should_run_hook) {
>> +			/* virtual file system relies on the sparse checkout logic so force it on */
>> +			core_apply_sparse_checkout = 1;
>> +			return 1;
>> +		}
>> +		core_virtualfilesystem = NULL;
> 
> It would be a small leak if this came from config_get_pathname(),
> but if it came from $GIT_TEST_VFS env, we cannot free it X-<.
> 
> A helper function called *_get_X() that does not return X as its
> return value or updating the location pointed by its *dst parameter,
> and instead only stores its finding in a global variable feels
> somewhat odd.  It smells more like "find out", "probe", "check",
> etc.
> 

I agree.  Frankly, I think it should just return whether it should be 
used or not (bool) and the hook name should be fixed.  I got push back 
when I did that for fsmonitor so I made this the same in an effort to 
head off that same feedback.

>> diff --git a/dir.c b/dir.c
>> index 47c2fca8dc..3097b0e446 100644
>> --- a/dir.c
>> +++ b/dir.c
>> @@ -21,6 +21,7 @@
>>   #include "ewah/ewok.h"
>>   #include "fsmonitor.h"
>>   #include "submodule-config.h"
>> +#include "virtualfilesystem.h"
>>   
>>   /*
>>    * Tells read_directory_recursive how a file or directory should be treated.
>> @@ -1109,6 +1110,18 @@ int is_excluded_from_list(const char *pathname,
>>   			  struct exclude_list *el, struct index_state *istate)
>>   {
>>   	struct exclude *exclude;
>> +
>> +	/*
>> +	 * The virtual file system data is used to prevent git from traversing
>> +	 * any part of the tree that is not in the virtual file system.  Return
>> +	 * 1 to exclude the entry if it is not found in the virtual file system,
>> +	 * else fall through to the regular excludes logic as it may further exclude.
>> +	 */
> 
> This comment will sit better immediately in front of the call to "is
> excluded from vfs?" helper function.
> 
>> +	if (*dtype == DT_UNKNOWN)
>> +		*dtype = get_dtype(NULL, istate, pathname, pathlen);
> 
> We try to defer paying cost to determine unknown *dtype as late as
> possible by having this call in last_exclude_matching_from_list(),
> and not here.  If we are doing this, we probably should update the
> callpaths that call last_exclude_matching_from_list() to make the
> caller responsible for doing get_dtype() and drop the lazy finding
> of dtype from the callee.  Alternatively, the new "is excluded from
> vfs" helper can learn to do the lazy get_dtype() just like the
> existing last_exclude_matching_from_list() does.  I suspect the
> latter may be simpler.

In is_excluded_from_virtualfilesystem() dtype can't be lazy because it 
is always needed (which is why I test and die if it isn't known).  I 
considered doing the test/call to get_dtype() within 
is_excluded_from_virtualfilesystem() but didn't like making it dependent 
on istate just so I could move the get_dtype() call in one level.  It is 
functionally identical so I can easily move it in if that is preferred.

> 
>> +	if (is_excluded_from_virtualfilesystem(pathname, pathlen, *dtype) > 0)
>> +		return 1;
>> +
>>   	exclude = last_exclude_matching_from_list(pathname, pathlen, basename,
>>   						  dtype, el, istate);
>>   	if (exclude)
>> @@ -1324,8 +1337,20 @@ struct exclude *last_exclude_matching(struct dir_struct *dir,
>>   int is_excluded(struct dir_struct *dir, struct index_state *istate,
>>   		const char *pathname, int *dtype_p)
>>   {
>> -	struct exclude *exclude =
>> -		last_exclude_matching(dir, istate, pathname, dtype_p);
>> +	struct exclude *exclude;
>> +
>> +	/*
>> +	 * The virtual file system data is used to prevent git from traversing
>> +	 * any part of the tree that is not in the virtual file system.  Return
>> +	 * 1 to exclude the entry if it is not found in the virtual file system,
>> +	 * else fall through to the regular excludes logic as it may further exclude.
>> +	 */
>> +	if (*dtype_p == DT_UNKNOWN)
>> +		*dtype_p = get_dtype(NULL, istate, pathname, strlen(pathname));
> 
> Exactly the same comment as above.
> 
>> +	if (is_excluded_from_virtualfilesystem(pathname, strlen(pathname), *dtype_p) > 0)
>> +		return 1;
>> +
>> +	exclude = last_exclude_matching(dir, istate, pathname, dtype_p);
>>   	if (exclude)
>>   		return exclude->flags & EXC_FLAG_NEGATIVE ? 0 : 1;
>>   	return 0;
>> @@ -1678,6 +1703,9 @@ static enum path_treatment treat_one_path(struct dir_struct *dir,
>>   	if (dtype != DT_DIR && has_path_in_index)
>>   		return path_none;
>>   
>> +	if (is_excluded_from_virtualfilesystem(path->buf, path->len, dtype) > 0)
>> +		return path_excluded;
>> +
>>   	/*
>>   	 * When we are looking at a directory P in the working tree,
>>   	 * there are three cases:
>> @@ -2018,6 +2046,8 @@ static enum path_treatment read_directory_recursive(struct dir_struct *dir,
>>   		/* add the path to the appropriate result list */
>>   		switch (state) {
>>   		case path_excluded:
>> +			if (is_excluded_from_virtualfilesystem(path.buf, path.len, DT_DIR) > 0)
>> +				break;
>>   			if (dir->flags & DIR_SHOW_IGNORED)
>>   				dir_add_name(dir, istate, path.buf, path.len);
>>   			else if ((dir->flags & DIR_SHOW_IGNORED_TOO) ||
> 
> I am kind-of surprised that the "damage" to dir.c need to support
> this is so isolated and small ;-)
> 

When I'm expecting to carry a patch in a fork I try to structure it to 
have as small of a footprint as possible (hence moving most of the code 
into a separate file).  Keeps the merge conflicts to a minimum. ;-)

>> ...
>> +if test "$1" != 1
>> +then
>> +	echo "Unsupported core.virtualfilesystem hook version." >&2
>> +	exit 1
>> +fi
>> +
>> +#find -type f -printf '%P\0'
>> +find -type d -printf '%P/\0'
> 
> I am not reading (hence not commenting on) tests in this review
> message yet.
> 
>> diff --git a/unpack-trees.c b/unpack-trees.c
>> index 7570df481b..ee3cda2e94 100644
>> --- a/unpack-trees.c
>> +++ b/unpack-trees.c
>> @@ -18,6 +18,7 @@
>>   #include "fsmonitor.h"
>>   #include "object-store.h"
>>   #include "fetch-object.h"
>> +#include "virtualfilesystem.h"
>>   
>>   /*
>>    * Error messages expected by scripts out of plumbing commands such as
>> @@ -1363,6 +1364,14 @@ static int clear_ce_flags_1(struct index_state *istate,
>>   			continue;
>>   		}
>>   
>> +		/* if it's not in the virtual file system, exit early */
>> +		if (core_virtualfilesystem) {
>> +			if (is_included_in_virtualfilesystem(ce->name, ce->ce_namelen) > 0)
>> +				ce->ce_flags &= ~clear_mask;
>> +			cache++;
>> +			continue;
>> +		}
> 
> Earlier we saw "is it excluded?" and now we have "is it included?"
> They have different function signature (i.e. "included?" does not
> need to know the type of the entry), and I am guessing that for the
> purpose of this particular patch that may be sufficient, but I have
> to wonder if in the longer term we'd be better off to keep the
> interface to these two functions similar.  Also, I wonder if we need
> both---I see below that these "is included/excluded?" helpers are
> allowed to say "I dunno", so that may be the reason why we cannot
> simply say "included is the same as !excluded".
> 

Most of the logic _is_ just inverted (which is why is_excluded...() can 
call is_included...() to do most of the work) but there are differences 
as well.  The differences are mostly about how parent directories are 
handled.  I tried to write it to share as much of the logic as possible 
but there may be something I've missed.


Your comments are all feedback on the code - how it was implemented, 
style, etc.  Any thoughts on whether this is something we could/should 
merge into master (after any necessary cleanup)?  Would anyone else find 
this interesting/helpful?
Ben Peart Oct. 31, 2018, 8:53 p.m. UTC | #4
On 10/31/2018 3:11 PM, Duy Nguyen wrote:
> not really a review, just  a couple quick notes..
> 

Perfect!  As an RFC, I'm more looking for high level thoughts/notes than 
a style/syntax code review.

> On Tue, Oct 30, 2018 at 9:40 PM Ben Peart <peartben@gmail.com> wrote:
>>
>> From: Ben Peart <benpeart@microsoft.com>
>>
>> On index load, clear/set the skip worktree bits based on the virtual
>> file system data. Use virtual file system data to update skip-worktree
>> bit in unpack-trees. Use virtual file system data to exclude files and
>> folders not explicitly requested.
>>
>> Signed-off-by: Ben Peart <benpeart@microsoft.com>
>> ---
>>
>> We have taken several steps to make git perform well on very large repos.
>> Some of those steps include: improving underlying algorithms, utilizing
>> multi-threading where possible, and simplifying the behavior of some commands.
>> These changes typically benefit all git repos to varying degrees.  While
>> these optimizations all help, they are insufficient to provide adequate
>> performance on the very large repos we often work with.
>>
>> To make git perform well on the very largest repos, we had to make more
>> significant changes.  The biggest performance win by far is the work we have
>> done to make git operations O(modified) instead of O(size of repo).  This
>> takes advantage of the fact that the number of files a developer has modified
>> is a tiny fraction of the overall repo size.
>>
>> We accomplished this by utilizing the existing internal logic for the skip
>> worktree bit and excludes to tell git to ignore all files and folders other
>> than those that have been modified.  This logic is driven by an external
>> process that monitors writes to the repo and communicates the list of files
>> and folders with changes to git via the virtual file system hook in this patch.
>>
>> The external process maintains a list of files and folders that have been
>> modified.  When git runs, it requests the list of files and folders that
>> have been modified via the virtual file system hook.  Git then sets/clears
>> the skip-worktree bit on the cache entries and builds a hashmap of the
>> modified files/folders that is used by the excludes logic to avoid scanning
>> the entire repo looking for changes and untracked files.
>>
>> With this system, we have been able to make local git command performance on
>> extremely large repos (millions of files, 1/2 million folders) entirely
>> manageable (30 second checkout, 3.5 seconds status, 4 second add, 7 second
>> commit, etc).
>>
>> Our desire is to eliminate all custom patches in our fork of git.  To that
>> end, I'm submitting this as an RFC to see how much interest there is and how
>> much willingness to take this type of change into git.
> 
> Most of these paragraphs (perhaps except the last one) should be part
> of the commit message. You describe briefly what the patch does but
> it's even more important to say why you want to do it.
> 
>> +core.virtualFilesystem::
>> +       If set, the value of this variable is used as a command which
>> +       will identify all files and directories that are present in
>> +       the working directory.  Git will only track and update files
>> +       listed in the virtual file system.  Using the virtual file system
>> +       will supersede the sparse-checkout settings which will be ignored.
>> +       See the "virtual file system" section of linkgit:githooks[6].
> 
> It sounds like "virtual file system" is just one of the use cases for
> this feature, which is more about a dynamic source of sparse-checkout
> bits. Perhaps name the config key with something along sparse checkout
> instead of naming it after a use case.

It's more than a dynamic sparse-checkout because the same list is also 
used to exclude any file/folder not listed.  That means any file not 
listed won't ever be updated by git (like in 'checkout' for example) so 
'stale' files could be left in the working directory.  It also means git 
won't find new/untracked files unless they are specifically added to the 
list.

> 
> This is a hook. I notice we start to avoid adding real hooks and just
> add config keys instead. Eventually we should have config-based hooks,
> but if we're going to add more like this, I think these should be in a
> separate section, hook.virtualFileSystem or something.
> 

That is a great idea.  I don't personally like specifying the hook as 
the 'flag' for whether a feature should be used.  I'd rather have it be 
a bool (enable the feature? true/false) and 1) either have the hook name 
hard coded (like most existing hooks) or 2) as you suggest add a 
consistent way to have config-based hooks.  Config based hooks could 
also help provide a consistent way to configure them using GIT_TEST_* 
environment variables for testing.

> I don't think the superseding makes sense. There's no reason this
> could not be used in combination with $GIT_DIR/info/sparse-checkout.
> If you don't want both, disable the other.
> 
> One last note. Since this is related to filesystem. Shouldn't it be
> part of fsmonitor (the protocol, not the implementation)? Then
> watchman user could use it to.
> 

To get this to work properly takes a lot more logic than exists in 
fsmonitor/watchman.  The challenge is that 1) fsmonitor/watchman is 
focused on "what has changed since <time>" and 2) doesn't currently 
impact the excludes logic.

If you attempted to use this with watchman there is a chicken and egg 
problem.  The initial git checkout wouldn't write out _any_ files to the 
working directory as none have been modified.  There would be no way to 
get them populated where they could even get modified to get added to 
the list.  Not very useful. :-)

This works with VFS for Git because it provides a virtual projection and 
will dynamically write out the contents of the file in the working 
directory as they are read.  It makes it appear that they are there and 
will fetch the actual contents on demand transparently.  If the user 
ends up modifying the file, it will get added to the virtual projection 
list so that git will start to pay attention to and update that file.

If the files are only read (and not written) by the user, the version on 
disk must be maintained by the VFS for Git daemon because git is 
completely unaware of them.  That means the daemon must detect when the 
git commit changes and remove the contents of all the files that were 
read but not written and start projecting the files from new commit.

In short, this is only one small piece of what is necessary to get a 
fully virtualized git repo.  It's an important piece but only one of the 
many pieces.  My reason for submitting this RFC is to start the 
discussion about how interested the community is in enabling repo 
virtualization in the mainline version of git.
Duy Nguyen Nov. 4, 2018, 6:34 a.m. UTC | #5
On Wed, Oct 31, 2018 at 9:53 PM Ben Peart <peartben@gmail.com> wrote:
> >> +core.virtualFilesystem::
> >> +       If set, the value of this variable is used as a command which
> >> +       will identify all files and directories that are present in
> >> +       the working directory.  Git will only track and update files
> >> +       listed in the virtual file system.  Using the virtual file system
> >> +       will supersede the sparse-checkout settings which will be ignored.
> >> +       See the "virtual file system" section of linkgit:githooks[6].
> >
> > It sounds like "virtual file system" is just one of the use cases for
> > this feature, which is more about a dynamic source of sparse-checkout
> > bits. Perhaps name the config key with something along sparse checkout
> > instead of naming it after a use case.
>
> It's more than a dynamic sparse-checkout because the same list is also
> used to exclude any file/folder not listed.  That means any file not
> listed won't ever be updated by git (like in 'checkout' for example) so
> 'stale' files could be left in the working directory.  It also means git
> won't find new/untracked files unless they are specifically added to the
> list.

OK. I'm not at all interested in carrying maintenance burden for some
software behind closed doors. I could see values in having a more
flexible sparse checkout but this now seems like very tightly designed
for GVFS. So unless there's another use case (preferably open source)
for this, I don't think this should be added in git.git.
brian m. carlson Nov. 4, 2018, 9:01 p.m. UTC | #6
On Sun, Nov 04, 2018 at 07:34:01AM +0100, Duy Nguyen wrote:
> On Wed, Oct 31, 2018 at 9:53 PM Ben Peart <peartben@gmail.com> wrote:
> > It's more than a dynamic sparse-checkout because the same list is also
> > used to exclude any file/folder not listed.  That means any file not
> > listed won't ever be updated by git (like in 'checkout' for example) so
> > 'stale' files could be left in the working directory.  It also means git
> > won't find new/untracked files unless they are specifically added to the
> > list.
> 
> OK. I'm not at all interested in carrying maintenance burden for some
> software behind closed doors. I could see values in having a more
> flexible sparse checkout but this now seems like very tightly designed
> for GVFS. So unless there's another use case (preferably open source)
> for this, I don't think this should be added in git.git.

I should point out that VFS for Git is an open-source project and will
likely have larger use than just at Microsoft.  There are both Windows
and Mac clients and there are plans for a Linux client as well.
Ideally, it would work with an unmodified upstream Git, which is (I
assume) why Ben is sending this series.

Personally, I don't love the current name used in this series.  I don't
see this patch as introducing a virtual file system in the Unix sense of
that word, and I think calling it that in Git core will be confusing to
Unix users.  I would prefer to see it as a hook (maybe called
"sparse-checkout" or "sparse-exclude"; better names are okay), and
simply turn it on based on whether or not there's an appropriate hook
file there and whether core.sparseCheckout is on (or possibly with
hook.sparseExclude or something).  With a design more like that, I don't
see a problem with it in principle.
Junio C Hamano Nov. 5, 2018, 12:02 a.m. UTC | #7
Ben Peart <peartben@gmail.com> writes:

>>> +	if (*dtype == DT_UNKNOWN)
>>> +		*dtype = get_dtype(NULL, istate, pathname, pathlen);
>>
>> We try to defer paying cost to determine unknown *dtype as late as
>> possible by having this call in last_exclude_matching_from_list(),
>> and not here.  If we are doing this, we probably should update the
>> callpaths that call last_exclude_matching_from_list() to make the
>> caller responsible for doing get_dtype() and drop the lazy finding
>> of dtype from the callee.  Alternatively, the new "is excluded from
>> vfs" helper can learn to do the lazy get_dtype() just like the
>> existing last_exclude_matching_from_list() does.  I suspect the
>> latter may be simpler.
>
> In is_excluded_from_virtualfilesystem() dtype can't be lazy because it
> is always needed (which is why I test and die if it isn't known).  

You make a call to that function even when virtual-file-system hook
is not in use, i.e. instead of the caller saying

	if (is_vfs_in_use()) {
		*dtype = get_dtype(...);
                if (is_excluded_from_vfs(...) > 0)
			return 1;
	}

your caller makes an unconditional call to is_excluded_from_vfs().
Isn't that the only reason why you break the laziness of determining
dtype?

You can keep the caller simple by making an unconditional call, but
maintain the laziness by updating the callig convention to pass
dtype (not *dtype) to the function, e.g..

	if (is_excluded_from_vfs(pathname, pathlen, dtype) > 0)
		return 1;

and then at the beginning of the helper

	if (is_vfs_in_use())
		return -1; /* undetermined */
	*dtype = get_dtype(...);
	... whatever logic it has now ...

no?

> Your comments are all feedback on the code - how it was implemented,
> style, etc.  Any thoughts on whether this is something we could/should
> merge into master (after any necessary cleanup)?  Would anyone else
> find this interesting/helpful?

I am pretty much neutral.  Not strongly opposed to it, but not all
that interested until seeing its integration with the "userland" to
see how the whole thing works ;-)
Ævar Arnfjörð Bjarmason Nov. 5, 2018, 11:40 a.m. UTC | #8
On Sun, Nov 04 2018, Duy Nguyen wrote:

> On Wed, Oct 31, 2018 at 9:53 PM Ben Peart <peartben@gmail.com> wrote:
>> >> +core.virtualFilesystem::
>> >> +       If set, the value of this variable is used as a command which
>> >> +       will identify all files and directories that are present in
>> >> +       the working directory.  Git will only track and update files
>> >> +       listed in the virtual file system.  Using the virtual file system
>> >> +       will supersede the sparse-checkout settings which will be ignored.
>> >> +       See the "virtual file system" section of linkgit:githooks[6].
>> >
>> > It sounds like "virtual file system" is just one of the use cases for
>> > this feature, which is more about a dynamic source of sparse-checkout
>> > bits. Perhaps name the config key with something along sparse checkout
>> > instead of naming it after a use case.
>>
>> It's more than a dynamic sparse-checkout because the same list is also
>> used to exclude any file/folder not listed.  That means any file not
>> listed won't ever be updated by git (like in 'checkout' for example) so
>> 'stale' files could be left in the working directory.  It also means git
>> won't find new/untracked files unless they are specifically added to the
>> list.
>
> OK. I'm not at all interested in carrying maintenance burden for some
> software behind closed doors. I could see values in having a more
> flexible sparse checkout but this now seems like very tightly designed
> for GVFS. So unless there's another use case (preferably open source)
>  for this, I don't think this should be added in git.git.

I haven't looked at the patch in any detail beyond skimming it, and
you're more familiar with this area...

But in principle I'm very interested in getting something like this in
git.git, even if we were to assume GVFS was a 100% proprietary
implementation (which it's not, it's open source[1] and there's work on
a non-Windows port[2]).

By some definition we already support at least one "virtual filesystem",
or two. I.e. we check files into a *.git repository where everything's
trees & blobs, and then we need to ferry it back & forth between that
and a POSIX-like fs using readdir(), stat(), open() and the like. Of
course this isn't "virtual" in the fs sense, but it's a foreign virtual
FS as far as git is concerned.

Our other "virtual filesystem" (although this is a stretch) is what we
expose via the plumbing commands.

Both of these, as covered at length in various GVFS design docs /
background info, aren't a good fit for some implementation where you'd
like a file-like view that works with the index/status etc. but doesn't
involve checking out everything. There's a huge gap to bridge between
our plumbing & the features we expose that require a POSIX-like fs.

I can see such an ability being very useful for things that aren't the
massive repo Microsoft is interested in supporting. E.g. even for a
~500MB repo it's a stretch to clone that, edit it, and submit a PR on
something like an iOS/Android device.

Once we have something in git to support the likes of GVFS supporting
use-cases like that also becomes easier. Between shallow cloning, the
various object filters & this it would be great to be able to get a 1MB
clone of git.git on my phone and submit a patch to it.

The only potential downside I see is that there's currently exactly one
implementation of this sort of thing in the wild, so we risk any such
API becoming too tied up with just what GVFS wants, and not what we'd
like to support with such a thing in general. This is what e.g. the w3c
tries to avoid with having multiple browser implementations before
something is standardized.

But I think that's easy to work around. E.g. we can document any such
API saying that it's an experimental v1 and is on probation until we get
more users of the feature and figure out the warts & limitations. If it
does end up being too GVFS-specific we could make a breaking change at
some point and eventually drop the v1 version.

1. https://gvfs.io/
2. https://arstechnica.com/gadgets/2017/11/microsoft-and-github-team-up-to-take-git-virtual-file-system-to-macos-linux/
Duy Nguyen Nov. 5, 2018, 3:22 p.m. UTC | #9
On Sun, Nov 4, 2018 at 10:01 PM brian m. carlson
<sandals@crustytoothpaste.net> wrote:
>
> On Sun, Nov 04, 2018 at 07:34:01AM +0100, Duy Nguyen wrote:
> > On Wed, Oct 31, 2018 at 9:53 PM Ben Peart <peartben@gmail.com> wrote:
> > > It's more than a dynamic sparse-checkout because the same list is also
> > > used to exclude any file/folder not listed.  That means any file not
> > > listed won't ever be updated by git (like in 'checkout' for example) so
> > > 'stale' files could be left in the working directory.  It also means git
> > > won't find new/untracked files unless they are specifically added to the
> > > list.
> >
> > OK. I'm not at all interested in carrying maintenance burden for some
> > software behind closed doors. I could see values in having a more
> > flexible sparse checkout but this now seems like very tightly designed
> > for GVFS. So unless there's another use case (preferably open source)
> > for this, I don't think this should be added in git.git.
>
> I should point out that VFS for Git is an open-source project and will
> likely have larger use than just at Microsoft.  There are both Windows
> and Mac clients and there are plans for a Linux client as well.
> Ideally, it would work with an unmodified upstream Git, which is (I
> assume) why Ben is sending this series.

Ah I didn't know that. Thank you. I'll have to look at this GVFS some time then.

If we're going to support GVFS though, I think there should be a big
(RFC perhaps) series that includes everything to at least give an
overview what the end game looks like. Then it could be split up into
smaller series.
Duy Nguyen Nov. 5, 2018, 3:26 p.m. UTC | #10
On Mon, Nov 5, 2018 at 12:40 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Sun, Nov 04 2018, Duy Nguyen wrote:
>
> > On Wed, Oct 31, 2018 at 9:53 PM Ben Peart <peartben@gmail.com> wrote:
> >> >> +core.virtualFilesystem::
> >> >> +       If set, the value of this variable is used as a command which
> >> >> +       will identify all files and directories that are present in
> >> >> +       the working directory.  Git will only track and update files
> >> >> +       listed in the virtual file system.  Using the virtual file system
> >> >> +       will supersede the sparse-checkout settings which will be ignored.
> >> >> +       See the "virtual file system" section of linkgit:githooks[6].
> >> >
> >> > It sounds like "virtual file system" is just one of the use cases for
> >> > this feature, which is more about a dynamic source of sparse-checkout
> >> > bits. Perhaps name the config key with something along sparse checkout
> >> > instead of naming it after a use case.
> >>
> >> It's more than a dynamic sparse-checkout because the same list is also
> >> used to exclude any file/folder not listed.  That means any file not
> >> listed won't ever be updated by git (like in 'checkout' for example) so
> >> 'stale' files could be left in the working directory.  It also means git
> >> won't find new/untracked files unless they are specifically added to the
> >> list.
> >
> > OK. I'm not at all interested in carrying maintenance burden for some
> > software behind closed doors. I could see values in having a more
> > flexible sparse checkout but this now seems like very tightly designed
> > for GVFS. So unless there's another use case (preferably open source)
> >  for this, I don't think this should be added in git.git.
>
> I haven't looked at the patch in any detail beyond skimming it, and
> you're more familiar with this area...
>
> But in principle I'm very interested in getting something like this in
> git.git, even if we were to assume GVFS was a 100% proprietary
> implementation.

I have nothing against building a GVFS-like solution. If what's
submitted can be the building blocks for that, great! But if it was
just for GVFS (and it was not available to everybody) then no thank
you.
Ben Peart Nov. 5, 2018, 8 p.m. UTC | #11
On 11/4/2018 7:02 PM, Junio C Hamano wrote:
> Ben Peart <peartben@gmail.com> writes:
> 
>>>> +	if (*dtype == DT_UNKNOWN)
>>>> +		*dtype = get_dtype(NULL, istate, pathname, pathlen);
>>>
>>> We try to defer paying cost to determine unknown *dtype as late as
>>> possible by having this call in last_exclude_matching_from_list(),
>>> and not here.  If we are doing this, we probably should update the
>>> callpaths that call last_exclude_matching_from_list() to make the
>>> caller responsible for doing get_dtype() and drop the lazy finding
>>> of dtype from the callee.  Alternatively, the new "is excluded from
>>> vfs" helper can learn to do the lazy get_dtype() just like the
>>> existing last_exclude_matching_from_list() does.  I suspect the
>>> latter may be simpler.
>>
>> In is_excluded_from_virtualfilesystem() dtype can't be lazy because it
>> is always needed (which is why I test and die if it isn't known).
> 
> You make a call to that function even when virtual-file-system hook
> is not in use, i.e. instead of the caller saying
> 
> 	if (is_vfs_in_use()) {
> 		*dtype = get_dtype(...);
>                  if (is_excluded_from_vfs(...) > 0)
> 			return 1;
> 	}
> 
> your caller makes an unconditional call to is_excluded_from_vfs().
> Isn't that the only reason why you break the laziness of determining
> dtype?
> 
> You can keep the caller simple by making an unconditional call, but
> maintain the laziness by updating the callig convention to pass
> dtype (not *dtype) to the function, e.g..
> 
> 	if (is_excluded_from_vfs(pathname, pathlen, dtype) > 0)
> 		return 1;
> 
> and then at the beginning of the helper
> 
> 	if (is_vfs_in_use())
> 		return -1; /* undetermined */
> 	*dtype = get_dtype(...);
> 	... whatever logic it has now ...
> 
> no?
> 

Oops!  You're right, I docall get_dtype() even if the vfs isn't in use. 
I'll add an additional test to avoid doing that.  Thank you.

I did look into moving the delay load logic into is_excluded_from_vfs() 
but get_dtype() is static to dir.c and I'd prefer to keep it that way if 
possible.

>> Your comments are all feedback on the code - how it was implemented,
>> style, etc.  Any thoughts on whether this is something we could/should
>> merge into master (after any necessary cleanup)?  Would anyone else
>> find this interesting/helpful?
> 
> I am pretty much neutral.  Not strongly opposed to it, but not all
> that interested until seeing its integration with the "userland" to
> see how the whole thing works ;-)
>
Ben Peart Nov. 5, 2018, 8:07 p.m. UTC | #12
On 11/5/2018 10:26 AM, Duy Nguyen wrote:
> On Mon, Nov 5, 2018 at 12:40 PM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>
>>
>> On Sun, Nov 04 2018, Duy Nguyen wrote:
>>
>>> On Wed, Oct 31, 2018 at 9:53 PM Ben Peart <peartben@gmail.com> wrote:
>>>>>> +core.virtualFilesystem::
>>>>>> +       If set, the value of this variable is used as a command which
>>>>>> +       will identify all files and directories that are present in
>>>>>> +       the working directory.  Git will only track and update files
>>>>>> +       listed in the virtual file system.  Using the virtual file system
>>>>>> +       will supersede the sparse-checkout settings which will be ignored.
>>>>>> +       See the "virtual file system" section of linkgit:githooks[6].
>>>>>
>>>>> It sounds like "virtual file system" is just one of the use cases for
>>>>> this feature, which is more about a dynamic source of sparse-checkout
>>>>> bits. Perhaps name the config key with something along sparse checkout
>>>>> instead of naming it after a use case.
>>>>
>>>> It's more than a dynamic sparse-checkout because the same list is also
>>>> used to exclude any file/folder not listed.  That means any file not
>>>> listed won't ever be updated by git (like in 'checkout' for example) so
>>>> 'stale' files could be left in the working directory.  It also means git
>>>> won't find new/untracked files unless they are specifically added to the
>>>> list.
>>>
>>> OK. I'm not at all interested in carrying maintenance burden for some
>>> software behind closed doors. I could see values in having a more
>>> flexible sparse checkout but this now seems like very tightly designed
>>> for GVFS. So unless there's another use case (preferably open source)
>>>   for this, I don't think this should be added in git.git.
>>
>> I haven't looked at the patch in any detail beyond skimming it, and
>> you're more familiar with this area...
>>
>> But in principle I'm very interested in getting something like this in
>> git.git, even if we were to assume GVFS was a 100% proprietary
>> implementation.
> 
> I have nothing against building a GVFS-like solution. If what's
> submitted can be the building blocks for that, great! But if it was
> just for GVFS (and it was not available to everybody) then no thank
> you.
> 

Not only is VFS for Git open source and is/will be supported on Windows, 
Mac and Linux, the interface being proposed is quite generic so should 
be usable for other implementations.

To use it, you just need to provide a hook that will return a list of 
files git should pay attention to (using a subset of the existing 
sparse-checkout format).

If you see anything that would make using it difficult for other 
solutions to use, let's fix it now!
Ben Peart Nov. 5, 2018, 8:18 p.m. UTC | #13
On 11/5/2018 10:22 AM, Duy Nguyen wrote:
> On Sun, Nov 4, 2018 at 10:01 PM brian m. carlson
> <sandals@crustytoothpaste.net> wrote:
>>
>> On Sun, Nov 04, 2018 at 07:34:01AM +0100, Duy Nguyen wrote:
>>> On Wed, Oct 31, 2018 at 9:53 PM Ben Peart <peartben@gmail.com> wrote:
>>>> It's more than a dynamic sparse-checkout because the same list is also
>>>> used to exclude any file/folder not listed.  That means any file not
>>>> listed won't ever be updated by git (like in 'checkout' for example) so
>>>> 'stale' files could be left in the working directory.  It also means git
>>>> won't find new/untracked files unless they are specifically added to the
>>>> list.
>>>
>>> OK. I'm not at all interested in carrying maintenance burden for some
>>> software behind closed doors. I could see values in having a more
>>> flexible sparse checkout but this now seems like very tightly designed
>>> for GVFS. So unless there's another use case (preferably open source)
>>> for this, I don't think this should be added in git.git.
>>
>> I should point out that VFS for Git is an open-source project and will
>> likely have larger use than just at Microsoft.  There are both Windows
>> and Mac clients and there are plans for a Linux client as well.
>> Ideally, it would work with an unmodified upstream Git, which is (I
>> assume) why Ben is sending this series.
> 
> Ah I didn't know that. Thank you. I'll have to look at this GVFS some time then.
> 
> If we're going to support GVFS though, I think there should be a big
> (RFC perhaps) series that includes everything to at least give an
> overview what the end game looks like. Then it could be split up into
> smaller series.
> 

We've always had the goal of not needing a fork at all and are 
continually working to bring the list of differences to zero but in the 
mean time, you can see the entire set of changes we've made here [1].

If you look, most of them are changes we are already in process of 
submitting to git (ie midx, tracing, etc) or patches we fast tracked 
from master to our branch (your unpack_trees() optimizations for example).

Most of the others are small tweaks and features for performance or to 
smooth integration issues.  This RFC contains the core changes that were 
required to enable VFS for Git.

[1] 
https://github.com/git-for-windows/git/compare/master...Microsoft:gvfs-2.19.1
Ben Peart Nov. 5, 2018, 8:27 p.m. UTC | #14
On 11/4/2018 4:01 PM, brian m. carlson wrote:
> On Sun, Nov 04, 2018 at 07:34:01AM +0100, Duy Nguyen wrote:
>> On Wed, Oct 31, 2018 at 9:53 PM Ben Peart <peartben@gmail.com> wrote:
>>> It's more than a dynamic sparse-checkout because the same list is also
>>> used to exclude any file/folder not listed.  That means any file not
>>> listed won't ever be updated by git (like in 'checkout' for example) so
>>> 'stale' files could be left in the working directory.  It also means git
>>> won't find new/untracked files unless they are specifically added to the
>>> list.
>>
>> OK. I'm not at all interested in carrying maintenance burden for some
>> software behind closed doors. I could see values in having a more
>> flexible sparse checkout but this now seems like very tightly designed
>> for GVFS. So unless there's another use case (preferably open source)
>> for this, I don't think this should be added in git.git.
> 
> I should point out that VFS for Git is an open-source project and will
> likely have larger use than just at Microsoft.  There are both Windows
> and Mac clients and there are plans for a Linux client as well.
> Ideally, it would work with an unmodified upstream Git, which is (I
> assume) why Ben is sending this series.
> 
> Personally, I don't love the current name used in this series.  I don't
> see this patch as introducing a virtual file system in the Unix sense of
> that word, and I think calling it that in Git core will be confusing to
> Unix users.  I would prefer to see it as a hook (maybe called
> "sparse-checkout" or "sparse-exclude"; better names are okay), and
> simply turn it on based on whether or not there's an appropriate hook
> file there and whether core.sparseCheckout is on (or possibly with
> hook.sparseExclude or something).  With a design more like that, I don't
> see a problem with it in principle.
> 

I'm really bad at naming so am happy to choose something else that will 
be more descriptive to the community at large.  The name came from the 
fact that we started with the (equally awful) 'VFS for Git' and this was 
the big enabling feature in git so for better or worse it got saddled 
with the same 'VFS' name.

In other feedback it was suggested to not add a core.vfs setting that 
was the path to the hook and I like that.  I can change it to 
core.sparseExclude (unless someone has something better) and hard code 
the hook name for now.  I do like the idea of having config based hooks 
so when I get some time I will put together a patch series to implement 
that.
Johannes Schindelin Nov. 5, 2018, 9:53 p.m. UTC | #15
Hi Ævar,

On Mon, 5 Nov 2018, Ævar Arnfjörð Bjarmason wrote:

> The only potential downside I see is that there's currently exactly one
> implementation of this sort of thing in the wild, so we risk any such
> API becoming too tied up with just what GVFS wants, and not what we'd
> like to support with such a thing in general. This is what e.g. the w3c
> tries to avoid with having multiple browser implementations before
> something is standardized.

It is my understanding that Ben is quite interested in ideas how to make
this *not* tied up with VFSforGit.

I'd think that he would welcome any good ideas you have in that direction.

Ciao,
Dscho

Patch
diff mbox series

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 09e95e9e98..dd4b834375 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -441,6 +441,14 @@  core.fsmonitor::
 	avoiding unnecessary processing of files that have not changed.
 	See the "fsmonitor-watchman" section of linkgit:githooks[5].
 
+core.virtualFilesystem::
+	If set, the value of this variable is used as a command which
+	will identify all files and directories that are present in
+	the working directory.  Git will only track and update files
+	listed in the virtual file system.  Using the virtual file system
+	will supersede the sparse-checkout settings which will be ignored.
+	See the "virtual file system" section of linkgit:githooks[6].
+
 core.trustctime::
 	If false, the ctime differences between the index and the
 	working tree are ignored; useful when the inode change time
diff --git a/Documentation/githooks.txt b/Documentation/githooks.txt
index 959044347e..87f9ad2a77 100644
--- a/Documentation/githooks.txt
+++ b/Documentation/githooks.txt
@@ -492,6 +492,26 @@  This hook is invoked by `git-p4 submit`. It takes no parameters and nothing
 from standard input. Exiting with non-zero status from this script prevent
 `git-p4 submit` from launching. Run `git-p4 submit --help` for details.
 
+virtualFilesystem
+~~~~~~~~~~~~~~~~~~
+
+"Virtual File System" allows populating the working directory sparsely.
+The projection data is typically automatically generated by an external
+process.  Git will limit what files it checks for changes as well as which
+directories are checked for untracked files based on the path names given.
+Git will also only update those files listed in the projection.
+
+The hook is invoked when the configuration option core.virtualFilesystem
+is set.  It takes one argument, a version (currently 1).
+
+The hook should output to stdout the list of all files in the working
+directory that git should track.  The paths are relative to the root
+of the working directory and are separated by a single NUL.  Full paths
+('dir1/a.txt') as well as directories are supported (ie 'dir1/').
+
+The exit status determines whether git will use the data from the
+hook.  On error, git will abort the command with an error message.
+
 GIT
 ---
 Part of the linkgit:git[1] suite
diff --git a/Makefile b/Makefile
index b08d5ea258..bcb404c28d 100644
--- a/Makefile
+++ b/Makefile
@@ -1009,6 +1009,7 @@  LIB_OBJS += utf8.o
 LIB_OBJS += varint.o
 LIB_OBJS += version.o
 LIB_OBJS += versioncmp.o
+LIB_OBJS += virtualfilesystem.o
 LIB_OBJS += walker.o
 LIB_OBJS += wildmatch.o
 LIB_OBJS += worktree.o
diff --git a/cache.h b/cache.h
index f7fabdde8f..d5815397c2 100644
--- a/cache.h
+++ b/cache.h
@@ -881,6 +881,7 @@  extern char *git_replace_ref_base;
 extern int fsync_object_files;
 extern int core_preload_index;
 extern int core_apply_sparse_checkout;
+extern const char *core_virtualfilesystem;
 extern int precomposed_unicode;
 extern int protect_hfs;
 extern int protect_ntfs;
diff --git a/config.c b/config.c
index 4051e38823..96e05ee0f1 100644
--- a/config.c
+++ b/config.c
@@ -1325,7 +1325,11 @@  static int git_default_core_config(const char *var, const char *value)
 	}
 
 	if (!strcmp(var, "core.sparsecheckout")) {
-		core_apply_sparse_checkout = git_config_bool(var, value);
+		/* virtual file system relies on the sparse checkout logic so force it on */
+		if (core_virtualfilesystem)
+			core_apply_sparse_checkout = 1;
+		else
+			core_apply_sparse_checkout = git_config_bool(var, value);
 		return 0;
 	}
 
@@ -2307,6 +2311,37 @@  int git_config_get_index_threads(void)
 	return 0; /* auto */
 }
 
+int git_config_get_virtualfilesystem(void)
+{
+	if (git_config_get_pathname("core.virtualfilesystem", &core_virtualfilesystem))
+		core_virtualfilesystem = getenv("GIT_TEST_VIRTUALFILESYSTEM");
+
+	if (core_virtualfilesystem && !*core_virtualfilesystem)
+		core_virtualfilesystem = NULL;
+
+	if (core_virtualfilesystem) {
+		/*
+		 * Some git commands spawn helpers and redirect the index to a different
+		 * location.  These include "difftool -d" and the sequencer
+		 * (i.e. `git rebase -i`, `git cherry-pick` and `git revert`) and others.
+		 * In those instances we don't want to update their temporary index with
+		 * our virtualization data.
+		 */
+		char *default_index_file = xstrfmt("%s/%s", the_repository->gitdir, "index");
+		int should_run_hook = !strcmp(default_index_file, the_repository->index_file);
+
+		free(default_index_file);
+		if (should_run_hook) {
+			/* virtual file system relies on the sparse checkout logic so force it on */
+			core_apply_sparse_checkout = 1;
+			return 1;
+		}
+		core_virtualfilesystem = NULL;
+	}
+
+	return 0;
+}
+
 NORETURN
 void git_die_config_linenr(const char *key, const char *filename, int linenr)
 {
diff --git a/config.h b/config.h
index a06027e69b..2ed36d5e26 100644
--- a/config.h
+++ b/config.h
@@ -251,6 +251,7 @@  extern int git_config_get_split_index(void);
 extern int git_config_get_max_percent_split_change(void);
 extern int git_config_get_fsmonitor(void);
 extern int git_config_get_index_threads(void);
+extern int git_config_get_virtualfilesystem(void);
 
 /* This dies if the configured or default date is in the future */
 extern int git_config_get_expiry(const char *key, const char **output);
diff --git a/dir.c b/dir.c
index 47c2fca8dc..3097b0e446 100644
--- a/dir.c
+++ b/dir.c
@@ -21,6 +21,7 @@ 
 #include "ewah/ewok.h"
 #include "fsmonitor.h"
 #include "submodule-config.h"
+#include "virtualfilesystem.h"
 
 /*
  * Tells read_directory_recursive how a file or directory should be treated.
@@ -1109,6 +1110,18 @@  int is_excluded_from_list(const char *pathname,
 			  struct exclude_list *el, struct index_state *istate)
 {
 	struct exclude *exclude;
+
+	/*
+	 * The virtual file system data is used to prevent git from traversing
+	 * any part of the tree that is not in the virtual file system.  Return
+	 * 1 to exclude the entry if it is not found in the virtual file system,
+	 * else fall through to the regular excludes logic as it may further exclude.
+	 */
+	if (*dtype == DT_UNKNOWN)
+		*dtype = get_dtype(NULL, istate, pathname, pathlen);
+	if (is_excluded_from_virtualfilesystem(pathname, pathlen, *dtype) > 0)
+		return 1;
+
 	exclude = last_exclude_matching_from_list(pathname, pathlen, basename,
 						  dtype, el, istate);
 	if (exclude)
@@ -1324,8 +1337,20 @@  struct exclude *last_exclude_matching(struct dir_struct *dir,
 int is_excluded(struct dir_struct *dir, struct index_state *istate,
 		const char *pathname, int *dtype_p)
 {
-	struct exclude *exclude =
-		last_exclude_matching(dir, istate, pathname, dtype_p);
+	struct exclude *exclude;
+
+	/*
+	 * The virtual file system data is used to prevent git from traversing
+	 * any part of the tree that is not in the virtual file system.  Return
+	 * 1 to exclude the entry if it is not found in the virtual file system,
+	 * else fall through to the regular excludes logic as it may further exclude.
+	 */
+	if (*dtype_p == DT_UNKNOWN)
+		*dtype_p = get_dtype(NULL, istate, pathname, strlen(pathname));
+	if (is_excluded_from_virtualfilesystem(pathname, strlen(pathname), *dtype_p) > 0)
+		return 1;
+
+	exclude = last_exclude_matching(dir, istate, pathname, dtype_p);
 	if (exclude)
 		return exclude->flags & EXC_FLAG_NEGATIVE ? 0 : 1;
 	return 0;
@@ -1678,6 +1703,9 @@  static enum path_treatment treat_one_path(struct dir_struct *dir,
 	if (dtype != DT_DIR && has_path_in_index)
 		return path_none;
 
+	if (is_excluded_from_virtualfilesystem(path->buf, path->len, dtype) > 0)
+		return path_excluded;
+
 	/*
 	 * When we are looking at a directory P in the working tree,
 	 * there are three cases:
@@ -2018,6 +2046,8 @@  static enum path_treatment read_directory_recursive(struct dir_struct *dir,
 		/* add the path to the appropriate result list */
 		switch (state) {
 		case path_excluded:
+			if (is_excluded_from_virtualfilesystem(path.buf, path.len, DT_DIR) > 0)
+				break;
 			if (dir->flags & DIR_SHOW_IGNORED)
 				dir_add_name(dir, istate, path.buf, path.len);
 			else if ((dir->flags & DIR_SHOW_IGNORED_TOO) ||
diff --git a/environment.c b/environment.c
index 3f3c8746c2..eea50b71a7 100644
--- a/environment.c
+++ b/environment.c
@@ -68,6 +68,7 @@  enum object_creation_mode object_creation_mode = OBJECT_CREATION_MODE;
 char *notes_ref_name;
 int grafts_replace_parents = 1;
 int core_apply_sparse_checkout;
+const char *core_virtualfilesystem;
 int merge_log_config = -1;
 int precomposed_unicode = -1; /* see probe_utf8_pathname_composition() */
 unsigned long pack_size_limit_cfg;
diff --git a/read-cache.c b/read-cache.c
index d57958233e..afe09a0343 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -25,6 +25,7 @@ 
 #include "fsmonitor.h"
 #include "thread-utils.h"
 #include "progress.h"
+#include "virtualfilesystem.h"
 
 /* Mask for the name length in ce_flags in the on-disk index */
 
@@ -1890,6 +1891,7 @@  static void post_read_index_from(struct index_state *istate)
 	tweak_untracked_cache(istate);
 	tweak_split_index(istate);
 	tweak_fsmonitor(istate);
+	apply_virtualfilesystem(istate);
 }
 
 static size_t estimate_cache_size_from_compressed(unsigned int entries)
diff --git a/t/README b/t/README
index 8847489640..b79c77c8b5 100644
--- a/t/README
+++ b/t/README
@@ -343,6 +343,9 @@  of the index for the whole test suite by bypassing the default number of
 cache entries and thread minimums. Setting this to 1 will make the
 index loading single threaded.
 
+GIT_TEST_VIRTUALFILESYSTEM=$PWD/t1092/virtualfilesystem exercises the
+virtual file system code path for limiting what git pays attention to.
+
 Naming Tests
 ------------
 
diff --git a/t/t1092-virtualfilesystem.sh b/t/t1092-virtualfilesystem.sh
new file mode 100755
index 0000000000..a0f4200360
--- /dev/null
+++ b/t/t1092-virtualfilesystem.sh
@@ -0,0 +1,354 @@ 
+#!/bin/sh
+
+test_description='virtual file system tests'
+
+. ./test-lib.sh
+
+# We need total control of the virtual file system hook
+sane_unset GIT_TEST_VIRTUALFILESYSTEM
+
+clean_repo () {
+	rm .git/index &&
+	git -c core.virtualfilesystem= reset --hard HEAD &&
+	git -c core.virtualfilesystem= clean -fd &&
+	touch untracked.txt &&
+	touch dir1/untracked.txt &&
+	touch dir2/untracked.txt
+}
+
+test_expect_success 'setup' '
+	mkdir -p .git/hooks/ &&
+	cat > .gitignore <<-\EOF &&
+		.gitignore
+		expect*
+		actual*
+	EOF
+	mkdir -p dir1 &&
+	touch dir1/file1.txt &&
+	touch dir1/file2.txt &&
+	mkdir -p dir2 &&
+	touch dir2/file1.txt &&
+	touch dir2/file2.txt &&
+	git add . &&
+	git commit -m "initial" &&
+	git config --local core.virtualfilesystem .git/hooks/virtualfilesystem
+'
+
+test_expect_success 'test hook parameters and version' '
+	clean_repo &&
+	write_script .git/hooks/virtualfilesystem <<-\EOF &&
+		if test "$#" -ne 1
+		then
+			echo "$0: Exactly 1 argument expected" >&2
+			exit 2
+		fi
+
+		if test "$1" != 1
+		then
+			echo "$0: Unsupported hook version." >&2
+			exit 1
+		fi
+	EOF
+	git status &&
+	write_script .git/hooks/virtualfilesystem <<-\EOF &&
+		exit 3
+	EOF
+	test_must_fail git status
+'
+
+test_expect_success 'verify status is clean' '
+	clean_repo &&
+	write_script .git/hooks/virtualfilesystem <<-\EOF &&
+		printf "dir2/file1.txt\0"
+	EOF
+	rm -f .git/index &&
+	git checkout -f &&
+	write_script .git/hooks/virtualfilesystem <<-\EOF &&
+		printf "dir2/file1.txt\0"
+		printf "dir1/file1.txt\0"
+		printf "dir1/file2.txt\0"
+	EOF
+	git status > actual &&
+	cat > expected <<-\EOF &&
+		On branch master
+		nothing to commit, working tree clean
+	EOF
+	test_cmp expected actual
+'
+
+test_expect_success 'verify skip-worktree bit is set for absolute path' '
+	clean_repo &&
+	write_script .git/hooks/virtualfilesystem <<-\EOF &&
+		printf "dir1/file1.txt\0"
+	EOF
+	git ls-files -v > actual &&
+	cat > expected <<-\EOF &&
+		H dir1/file1.txt
+		S dir1/file2.txt
+		S dir2/file1.txt
+		S dir2/file2.txt
+	EOF
+	test_cmp expected actual
+'
+
+test_expect_success 'verify skip-worktree bit is cleared for absolute path' '
+	clean_repo &&
+	write_script .git/hooks/virtualfilesystem <<-\EOF &&
+		printf "dir1/file2.txt\0"
+	EOF
+	git ls-files -v > actual &&
+	cat > expected <<-\EOF &&
+		S dir1/file1.txt
+		H dir1/file2.txt
+		S dir2/file1.txt
+		S dir2/file2.txt
+	EOF
+	test_cmp expected actual
+'
+
+test_expect_success 'verify folder wild cards' '
+	clean_repo &&
+	write_script .git/hooks/virtualfilesystem <<-\EOF &&
+		printf "dir1/\0"
+	EOF
+	git ls-files -v > actual &&
+	cat > expected <<-\EOF &&
+		H dir1/file1.txt
+		H dir1/file2.txt
+		S dir2/file1.txt
+		S dir2/file2.txt
+	EOF
+	test_cmp expected actual
+'
+
+test_expect_success 'verify folders not included are ignored' '
+	clean_repo &&
+	write_script .git/hooks/virtualfilesystem <<-\EOF &&
+		printf "dir1/file1.txt\0"
+		printf "dir1/file2.txt\0"
+	EOF
+	mkdir -p dir1/dir2 &&
+	touch dir1/a &&
+	touch dir1/b &&
+	touch dir1/dir2/a &&
+	touch dir1/dir2/b &&
+	git add . &&
+	git ls-files -v > actual &&
+	cat > expected <<-\EOF &&
+		H dir1/file1.txt
+		H dir1/file2.txt
+		S dir2/file1.txt
+		S dir2/file2.txt
+	EOF
+	test_cmp expected actual
+'
+
+test_expect_success 'verify including one file doesnt include the rest' '
+	clean_repo &&
+	write_script .git/hooks/virtualfilesystem <<-\EOF &&
+		printf "dir1/file1.txt\0"
+		printf "dir1/file2.txt\0"
+		printf "dir1/dir2/a\0"
+	EOF
+	mkdir -p dir1/dir2 &&
+	touch dir1/a &&
+	touch dir1/b &&
+	touch dir1/dir2/a &&
+	touch dir1/dir2/b &&
+	git add . &&
+	git ls-files -v > actual &&
+	cat > expected <<-\EOF &&
+		H dir1/dir2/a
+		H dir1/file1.txt
+		H dir1/file2.txt
+		S dir2/file1.txt
+		S dir2/file2.txt
+	EOF
+	test_cmp expected actual
+'
+
+test_expect_success 'verify files not listed are ignored by git clean -f -x' '
+	clean_repo &&
+	write_script .git/hooks/virtualfilesystem <<-\EOF &&
+		printf "untracked.txt\0"
+		printf "dir1/\0"
+	EOF
+	mkdir -p dir3 &&
+	touch dir3/untracked.txt &&
+	git clean -f -x &&
+	test ! -f untracked.txt &&
+	test -d dir1 &&
+	test -f dir1/file1.txt &&
+	test -f dir1/file2.txt &&
+	test ! -f dir1/untracked.txt &&
+	test -f dir2/file1.txt &&
+	test -f dir2/file2.txt &&
+	test -f dir2/untracked.txt &&
+	test -d dir3 &&
+	test -f dir3/untracked.txt
+'
+
+test_expect_success 'verify files not listed are ignored by git clean -f -d -x' '
+	clean_repo &&
+	write_script .git/hooks/virtualfilesystem <<-\EOF &&
+		printf "untracked.txt\0"
+		printf "dir1/\0"
+		printf "dir3/\0"
+	EOF
+	mkdir -p dir3 &&
+	touch dir3/untracked.txt &&
+	git clean -f -d -x &&
+	test ! -f untracked.txt &&
+	test -d dir1 &&
+	test -f dir1/file1.txt &&
+	test -f dir1/file2.txt &&
+	test ! -f dir1/untracked.txt &&
+	test -f dir2/file1.txt &&
+	test -f dir2/file2.txt &&
+	test -f dir2/untracked.txt &&
+	test ! -d dir3 &&
+	test ! -f dir3/untracked.txt
+'
+
+test_expect_success 'verify folder entries include all files' '
+	clean_repo &&
+	write_script .git/hooks/virtualfilesystem <<-\EOF &&
+		printf "dir1/\0"
+	EOF
+	mkdir -p dir1/dir2 &&
+	touch dir1/a &&
+	touch dir1/b &&
+	touch dir1/dir2/a &&
+	touch dir1/dir2/b &&
+	git status -su > actual &&
+	cat > expected <<-\EOF &&
+		?? dir1/a
+		?? dir1/b
+		?? dir1/dir2/a
+		?? dir1/dir2/b
+		?? dir1/untracked.txt
+	EOF
+	test_cmp expected actual
+'
+
+test_expect_success 'verify case insensitivity of virtual file system entries' '
+	clean_repo &&
+	write_script .git/hooks/virtualfilesystem <<-\EOF &&
+		printf "dir1/a\0"
+		printf "Dir1/Dir2/a\0"
+		printf "DIR2/\0"
+	EOF
+	mkdir -p dir1/dir2 &&
+	touch dir1/a &&
+	touch dir1/b &&
+	touch dir1/dir2/a &&
+	touch dir1/dir2/b &&
+	git -c core.ignorecase=false status -su > actual &&
+	cat > expected <<-\EOF &&
+		?? dir1/a
+	EOF
+	test_cmp expected actual &&
+	git -c core.ignorecase=true status -su > actual &&
+	cat > expected <<-\EOF &&
+		?? dir1/a
+		?? dir1/dir2/a
+		?? dir2/untracked.txt
+	EOF
+	test_cmp expected actual
+'
+
+test_expect_success 'on file created' '
+	clean_repo &&
+	write_script .git/hooks/virtualfilesystem <<-\EOF &&
+		printf "dir1/file3.txt\0"
+	EOF
+	touch dir1/file3.txt &&
+	git add . &&
+	git ls-files -v > actual &&
+	cat > expected <<-\EOF &&
+		S dir1/file1.txt
+		S dir1/file2.txt
+		H dir1/file3.txt
+		S dir2/file1.txt
+		S dir2/file2.txt
+	EOF
+	test_cmp expected actual
+'
+
+test_expect_success 'on file renamed' '
+	clean_repo &&
+	write_script .git/hooks/virtualfilesystem <<-\EOF &&
+		printf "dir1/file1.txt\0"
+		printf "dir1/file3.txt\0"
+	EOF
+	mv dir1/file1.txt dir1/file3.txt &&
+	git status -su > actual &&
+	cat > expected <<-\EOF &&
+		 D dir1/file1.txt
+		?? dir1/file3.txt
+	EOF
+	test_cmp expected actual
+'
+
+test_expect_success 'on file deleted' '
+	clean_repo &&
+	write_script .git/hooks/virtualfilesystem <<-\EOF &&
+		printf "dir1/file1.txt\0"
+	EOF
+	rm dir1/file1.txt &&
+	git status -su > actual &&
+	cat > expected <<-\EOF &&
+		 D dir1/file1.txt
+	EOF
+	test_cmp expected actual
+'
+
+test_expect_success 'on file overwritten' '
+	clean_repo &&
+	write_script .git/hooks/virtualfilesystem <<-\EOF &&
+		printf "dir1/file1.txt\0"
+	EOF
+	echo "overwritten" > dir1/file1.txt &&
+	git status -su > actual &&
+	cat > expected <<-\EOF &&
+		 M dir1/file1.txt
+	EOF
+	test_cmp expected actual
+'
+
+test_expect_success 'on folder created' '
+	clean_repo &&
+	write_script .git/hooks/virtualfilesystem <<-\EOF &&
+		printf "dir1/dir1/\0"
+	EOF
+	mkdir -p dir1/dir1 &&
+	git status -su > actual &&
+	cat > expected <<-\EOF &&
+	EOF
+	test_cmp expected actual &&
+	git clean -fd &&
+	test ! -d "/dir1/dir1"
+'
+
+test_expect_success 'on folder renamed' '
+	clean_repo &&
+	write_script .git/hooks/virtualfilesystem <<-\EOF &&
+		printf "dir3/\0"
+		printf "dir1/file1.txt\0"
+		printf "dir1/file2.txt\0"
+		printf "dir3/file1.txt\0"
+		printf "dir3/file2.txt\0"
+	EOF
+	mv dir1 dir3 &&
+	git status -su > actual &&
+	cat > expected <<-\EOF &&
+		 D dir1/file1.txt
+		 D dir1/file2.txt
+		?? dir3/file1.txt
+		?? dir3/file2.txt
+		?? dir3/untracked.txt
+	EOF
+	test_cmp expected actual
+'
+
+test_done
diff --git a/t/t1092/virtualfilesystem b/t/t1092/virtualfilesystem
new file mode 100755
index 0000000000..3f29c5f91f
--- /dev/null
+++ b/t/t1092/virtualfilesystem
@@ -0,0 +1,23 @@ 
+#!/bin/sh
+#
+# An test hook script to integrate with git to test the virtual file system.
+#
+# The hook is passed a version (currently 1) and outputs to stdout all files
+# that git should paid attention to.  Paths must be relative to the root of
+# the working tree and separated by a single NUL.
+#
+
+if test "$#" -ne 1
+then
+	echo "$0: exactly 1 argument expected" >&2
+	exit 2
+fi
+
+if test "$1" != 1
+then
+	echo "Unsupported core.virtualfilesystem hook version." >&2
+	exit 1
+fi
+
+#find -type f -printf '%P\0'
+find -type d -printf '%P/\0'
diff --git a/unpack-trees.c b/unpack-trees.c
index 7570df481b..ee3cda2e94 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -18,6 +18,7 @@ 
 #include "fsmonitor.h"
 #include "object-store.h"
 #include "fetch-object.h"
+#include "virtualfilesystem.h"
 
 /*
  * Error messages expected by scripts out of plumbing commands such as
@@ -1363,6 +1364,14 @@  static int clear_ce_flags_1(struct index_state *istate,
 			continue;
 		}
 
+		/* if it's not in the virtual file system, exit early */
+		if (core_virtualfilesystem) {
+			if (is_included_in_virtualfilesystem(ce->name, ce->ce_namelen) > 0)
+				ce->ce_flags &= ~clear_mask;
+			cache++;
+			continue;
+		}
+
 		if (prefix->len && strncmp(ce->name, prefix->buf, prefix->len))
 			break;
 
@@ -1481,12 +1490,16 @@  int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
 	if (!core_apply_sparse_checkout || !o->update)
 		o->skip_sparse_checkout = 1;
 	if (!o->skip_sparse_checkout) {
-		char *sparse = git_pathdup("info/sparse-checkout");
-		if (add_excludes_from_file_to_list(sparse, "", 0, &el, NULL) < 0)
-			o->skip_sparse_checkout = 1;
-		else
+		if (core_virtualfilesystem) {
 			o->el = &el;
-		free(sparse);
+		} else {
+			char *sparse = git_pathdup("info/sparse-checkout");
+			if (add_excludes_from_file_to_list(sparse, "", 0, &el, NULL) < 0)
+				o->skip_sparse_checkout = 1;
+			else
+				o->el = &el;
+			free(sparse);
+		}
 	}
 
 	memset(&o->result, 0, sizeof(o->result));
diff --git a/virtualfilesystem.c b/virtualfilesystem.c
new file mode 100644
index 0000000000..0d45803dbc
--- /dev/null
+++ b/virtualfilesystem.c
@@ -0,0 +1,308 @@ 
+#include "cache.h"
+#include "config.h"
+#include "dir.h"
+#include "hashmap.h"
+#include "run-command.h"
+#include "virtualfilesystem.h"
+
+#define HOOK_INTERFACE_VERSION	(1)
+
+static struct strbuf virtual_filesystem_data = STRBUF_INIT;
+static struct hashmap virtual_filesystem_hashmap;
+static struct hashmap parent_directory_hashmap;
+
+struct virtualfilesystem {
+	struct hashmap_entry ent; /* must be the first member! */
+	const char *pattern;
+	int patternlen;
+};
+
+static unsigned int(*vfshash)(const void *buf, size_t len);
+static int(*vfscmp)(const char *a, const char *b, size_t len);
+
+static int vfs_hashmap_cmp(const void *unused_cmp_data,
+	const void *a, const void *b, const void *key)
+{
+	const struct virtualfilesystem *vfs1 = a;
+	const struct virtualfilesystem *vfs2 = b;
+
+	return vfscmp(vfs1->pattern, vfs2->pattern, vfs1->patternlen);
+}
+
+static void get_virtual_filesystem_data(struct strbuf *vfs_data)
+{
+	struct child_process cp = CHILD_PROCESS_INIT;
+	int err;
+
+	strbuf_init(vfs_data, 0);
+
+	argv_array_push(&cp.args, core_virtualfilesystem);
+	argv_array_pushf(&cp.args, "%d", HOOK_INTERFACE_VERSION);
+	cp.use_shell = 1;
+	cp.dir = get_git_work_tree();
+
+	err = capture_command(&cp, vfs_data, 1024);
+	if (err)
+		die("unable to load virtual file system");
+}
+
+static int check_includes_hashmap(struct hashmap *map, const char *pattern, int patternlen)
+{
+	struct strbuf sb = STRBUF_INIT;
+	struct virtualfilesystem vfs;
+	char *slash;
+
+	/* Check straight mapping */
+	strbuf_reset(&sb);
+	strbuf_add(&sb, pattern, patternlen);
+	vfs.pattern = sb.buf;
+	vfs.patternlen = sb.len;
+	hashmap_entry_init(&vfs, vfshash(vfs.pattern, vfs.patternlen));
+	if (hashmap_get(map, &vfs, NULL)) {
+		strbuf_release(&sb);
+		return 1;
+	}
+
+	/*
+	 * Check to see if it matches a directory or any path
+	 * underneath it.  In other words, 'a/b/foo.txt' will match
+	 * '/', 'a/', and 'a/b/'.
+	 */
+	slash = strchr(sb.buf, '/');
+	while (slash) {
+		vfs.pattern = sb.buf;
+		vfs.patternlen = slash - sb.buf + 1;
+		hashmap_entry_init(&vfs, vfshash(vfs.pattern, vfs.patternlen));
+		if (hashmap_get(map, &vfs, NULL)) {
+			strbuf_release(&sb);
+			return 1;
+		}
+		slash = strchr(slash + 1, '/');
+	}
+
+	strbuf_release(&sb);
+	return 0;
+}
+
+static void includes_hashmap_add(struct hashmap *map, const char *pattern, const int patternlen)
+{
+	struct virtualfilesystem *vfs;
+
+	vfs = xmalloc(sizeof(struct virtualfilesystem));
+	vfs->pattern = pattern;
+	vfs->patternlen = patternlen;
+	hashmap_entry_init(vfs, vfshash(vfs->pattern, vfs->patternlen));
+	hashmap_add(map, vfs);
+}
+
+static void initialize_includes_hashmap(struct hashmap *map, struct strbuf *vfs_data)
+{
+	char *buf, *entry;
+	size_t len;
+	int i;
+
+	/*
+	 * Build a hashmap of the virtual file system data we can use to look
+	 * for cache entry matches quickly
+	 */
+	vfshash = ignore_case ? memihash : memhash;
+	vfscmp = ignore_case ? strncasecmp : strncmp;
+	hashmap_init(map, vfs_hashmap_cmp, NULL, 0);
+
+	entry = buf = vfs_data->buf;
+	len = vfs_data->len;
+	for (i = 0; i < len; i++) {
+		if (buf[i] == '\0') {
+			includes_hashmap_add(map, entry, buf + i - entry);
+			entry = buf + i + 1;
+		}
+	}
+}
+
+/*
+ * Return 1 if the requested item is found in the virtual file system,
+ * 0 for not found and -1 for undecided.
+ */
+int is_included_in_virtualfilesystem(const char *pathname, int pathlen)
+{
+	if (!core_virtualfilesystem)
+		return -1;
+
+	if (!virtual_filesystem_hashmap.tablesize && virtual_filesystem_data.len)
+		initialize_includes_hashmap(&virtual_filesystem_hashmap, &virtual_filesystem_data);
+	if (!virtual_filesystem_hashmap.tablesize)
+		return -1;
+
+	return check_includes_hashmap(&virtual_filesystem_hashmap, pathname, pathlen);
+}
+
+static void parent_directory_hashmap_add(struct hashmap *map, const char *pattern, const int patternlen)
+{
+	char *slash;
+	struct virtualfilesystem *vfs;
+
+	/*
+	 * Add any directories leading up to the file as the excludes logic
+	 * needs to match directories leading up to the files as well. Detect
+	 * and prevent unnecessary duplicate entries which will be common.
+	 */
+	if (patternlen > 1) {
+		slash = strchr(pattern + 1, '/');
+		while (slash) {
+			vfs = xmalloc(sizeof(struct virtualfilesystem));
+			vfs->pattern = pattern;
+			vfs->patternlen = slash - pattern + 1;
+			hashmap_entry_init(vfs, vfshash(vfs->pattern, vfs->patternlen));
+			if (hashmap_get(map, vfs, NULL))
+				free(vfs);
+			else
+				hashmap_add(map, vfs);
+			slash = strchr(slash + 1, '/');
+		}
+	}
+}
+
+static void initialize_parent_directory_hashmap(struct hashmap *map, struct strbuf *vfs_data)
+{
+	char *buf, *entry;
+	size_t len;
+	int i;
+
+	/*
+	 * Build a hashmap of the parent directories contained in the virtual
+	 * file system data we can use to look for matches quickly
+	 */
+	vfshash = ignore_case ? memihash : memhash;
+	vfscmp = ignore_case ? strncasecmp : strncmp;
+	hashmap_init(map, vfs_hashmap_cmp, NULL, 0);
+
+	entry = buf = vfs_data->buf;
+	len = vfs_data->len;
+	for (i = 0; i < len; i++) {
+		if (buf[i] == '\0') {
+			parent_directory_hashmap_add(map, entry, buf + i - entry);
+			entry = buf + i + 1;
+		}
+	}
+}
+
+static int check_directory_hashmap(struct hashmap *map, const char *pathname, int pathlen)
+{
+	struct strbuf sb = STRBUF_INIT;
+	struct virtualfilesystem vfs;
+
+	/* Check for directory */
+	strbuf_reset(&sb);
+	strbuf_add(&sb, pathname, pathlen);
+	strbuf_addch(&sb, '/');
+	vfs.pattern = sb.buf;
+	vfs.patternlen = sb.len;
+	hashmap_entry_init(&vfs, vfshash(vfs.pattern, vfs.patternlen));
+	if (hashmap_get(map, &vfs, NULL)) {
+		strbuf_release(&sb);
+		return 0;
+	}
+
+	strbuf_release(&sb);
+	return 1;
+}
+
+/*
+ * Return 1 for exclude, 0 for include and -1 for undecided.
+ */
+int is_excluded_from_virtualfilesystem(const char *pathname, int pathlen, int dtype)
+{
+	if (!core_virtualfilesystem)
+		return -1;
+
+	if (dtype != DT_REG && dtype != DT_DIR && dtype != DT_LNK)
+		die(_("is_excluded_from_virtualfilesystem passed unhandled dtype"));
+
+	if (dtype == DT_REG || dtype == DT_LNK) {
+		int ret = is_included_in_virtualfilesystem(pathname, pathlen);
+		if (ret > 0)
+			return 0;
+		if (ret == 0)
+			return 1;
+		return ret;
+	}
+
+	if (dtype == DT_DIR) {
+		int ret = is_included_in_virtualfilesystem(pathname, pathlen);
+		if (ret > 0)
+			return 0;
+
+		if (!parent_directory_hashmap.tablesize && virtual_filesystem_data.len)
+			initialize_parent_directory_hashmap(&parent_directory_hashmap, &virtual_filesystem_data);
+		if (!parent_directory_hashmap.tablesize)
+			return -1;
+
+		return check_directory_hashmap(&parent_directory_hashmap, pathname, pathlen);
+	}
+
+	return -1;
+}
+
+/*
+ * Update the CE_SKIP_WORKTREE bits based on the virtual file system.
+ */
+void apply_virtualfilesystem(struct index_state *istate)
+{
+	char *buf, *entry;
+	int i;
+
+	if (!git_config_get_virtualfilesystem())
+		return;
+
+	if (!virtual_filesystem_data.len)
+		get_virtual_filesystem_data(&virtual_filesystem_data);
+
+	/* set CE_SKIP_WORKTREE bit on all entries */
+	for (i = 0; i < istate->cache_nr; i++)
+		istate->cache[i]->ce_flags |= CE_SKIP_WORKTREE;
+
+	/* clear CE_SKIP_WORKTREE bit for everything in the virtual file system */
+	entry = buf = virtual_filesystem_data.buf;
+	for (i = 0; i < virtual_filesystem_data.len; i++) {
+		if (buf[i] == '\0') {
+			int pos, len;
+
+			len = buf + i - entry;
+
+			/* look for a directory wild card (ie "dir1/") */
+			if (buf[i - 1] == '/') {
+				if (ignore_case)
+					adjust_dirname_case(istate, entry);
+				pos = index_name_pos(istate, entry, len - 1);
+				if (pos < 0) {
+					pos = -pos - 1;
+					while (pos < istate->cache_nr && !fspathncmp(istate->cache[pos]->name, entry, len)) {
+						istate->cache[pos]->ce_flags &= ~CE_SKIP_WORKTREE;
+						pos++;
+					}
+				}
+			} else {
+				if (ignore_case) {
+					struct cache_entry *ce = index_file_exists(istate, entry, len, ignore_case);
+					if (ce)
+						ce->ce_flags &= ~CE_SKIP_WORKTREE;
+				} else {
+					int pos = index_name_pos(istate, entry, len);
+					if (pos >= 0)
+						istate->cache[pos]->ce_flags &= ~CE_SKIP_WORKTREE;
+				}
+			}
+
+			entry += len + 1;
+		}
+	}
+}
+
+/*
+ * Free the virtual file system data structures.
+ */
+void free_virtualfilesystem(void) {
+	hashmap_free(&virtual_filesystem_hashmap, 1);
+	hashmap_free(&parent_directory_hashmap, 1);
+	strbuf_release(&virtual_filesystem_data);
+}
diff --git a/virtualfilesystem.h b/virtualfilesystem.h
new file mode 100644
index 0000000000..5e8c5b096d
--- /dev/null
+++ b/virtualfilesystem.h
@@ -0,0 +1,25 @@ 
+#ifndef VIRTUALFILESYSTEM_H
+#define VIRTUALFILESYSTEM_H
+
+/*
+ * Update the CE_SKIP_WORKTREE bits based on the virtual file system.
+ */
+void apply_virtualfilesystem(struct index_state *istate);
+
+/*
+ * Return 1 if the requested item is found in the virtual file system,
+ * 0 for not found and -1 for undecided.
+ */
+int is_included_in_virtualfilesystem(const char *pathname, int pathlen);
+
+/*
+ * Return 1 for exclude, 0 for include and -1 for undecided.
+ */
+int is_excluded_from_virtualfilesystem(const char *pathname, int pathlen, int dtype);
+
+/*
+ * Free the virtual file system data structures.
+ */
+void free_virtualfilesystem(void);
+
+#endif