diff mbox series

clone: support cloning of filtered bundles

Message ID pull.1644.git.git.1705231010118.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series clone: support cloning of filtered bundles | expand

Commit Message

Nikolay Edigaryev Jan. 14, 2024, 11:16 a.m. UTC
From: Nikolay Edigaryev <edigaryev@gmail.com>

f18b512bbb (bundle: create filtered bundles, 2022-03-09) introduced an
incredibly useful ability to create filtered bundles, which advances
the partial clone/promisor support in Git and allows for archiving
large repositories to object storages like S3 in bundles that are:

* easy to manage
  * bundle is just a single file, it's easier to guarantee atomic
    replacements in object storages like S3 and they are faster to
    fetch than a bare repository since there's only a single GET
    request involved
* incredibly tiny
  * no indexes (which may be more than 10 MB for some repositories)
    and other fluff, compared to cloning a bare repository
  * bundle can be filtered to only contain the tips of refs neccessary
    for e.g. code-analysis purposes

However, in 86fdd94d72 (clone: fail gracefully when cloning filtered
bundle, 2022-03-09) the cloning of such bundles was disabled, with a
note that this behavior is not desired, and it the long-term this
should be possible.

The commit above states that it's not possible to have this at the
moment due to lack of remote and a repository-global config that
specifies an object filter, yet it's unclear why a remote-specific
config can't be used instead, which is what this change does.

Signed-off-by: Nikolay Edigaryev <edigaryev@gmail.com>
---
    clone: support cloning of filtered bundles
    
    f18b512bbb (bundle: create filtered bundles, 2022-03-09) introduced an
    incredibly useful ability to create filtered bundles, which advances the
    partial clone/promisor support in Git and allows for archiving large
    repositories to object storages like S3 in bundles that are:
    
     * easy to manage
       * bundle is just a single file, it's easier to guarantee atomic
         replacements in object storages like S3 and they are faster to
         fetch than a bare repository since there's only a single GET
         request involved
     * incredibly tiny
       * no indexes (which may be more than 10 MB for some repositories) and
         other fluff, compared to cloning a bare repository
       * bundle can be filtered to only contain the tips of refs neccessary
         for e.g. code-analysis purposes
    
    However, in 86fdd94d72 (clone: fail gracefully when cloning filtered
    bundle, 2022-03-09) the cloning of such bundles was disabled, with a
    note that this behavior is not desired, and it the long-term this should
    be possible.
    
    The commit above states that it's not possible to have this at the
    moment due to lack of remote and a repository-global config that
    specifies an object filter, yet it's unclear why a remote-specific
    config can't be used instead, which is what this change does.

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1644%2Fedigaryev%2Fclone-filtered-bundles-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1644/edigaryev/clone-filtered-bundles-v1
Pull-Request: https://github.com/git/git/pull/1644

 builtin/clone.c        | 13 +++++++++++--
 t/t6020-bundle-misc.sh | 13 +++----------
 2 files changed, 14 insertions(+), 12 deletions(-)


base-commit: 564d0252ca632e0264ed670534a51d18a689ef5d

Comments

Phillip Wood Jan. 14, 2024, 6 p.m. UTC | #1
Hi Nikolay

On 14/01/2024 11:16, Nikolay Edigaryev via GitGitGadget wrote:
> From: Nikolay Edigaryev <edigaryev@gmail.com>
> 
> f18b512bbb (bundle: create filtered bundles, 2022-03-09) introduced an
> incredibly useful ability to create filtered bundles, which advances
> the partial clone/promisor support in Git and allows for archiving
> large repositories to object storages like S3 in bundles that are:
> 
> * easy to manage
>    * bundle is just a single file, it's easier to guarantee atomic
>      replacements in object storages like S3 and they are faster to
>      fetch than a bare repository since there's only a single GET
>      request involved
> * incredibly tiny
>    * no indexes (which may be more than 10 MB for some repositories)
>      and other fluff, compared to cloning a bare repository
>    * bundle can be filtered to only contain the tips of refs neccessary
>      for e.g. code-analysis purposes
> 
> However, in 86fdd94d72 (clone: fail gracefully when cloning filtered
> bundle, 2022-03-09) the cloning of such bundles was disabled, with a
> note that this behavior is not desired, and it the long-term this
> should be possible.
> 
> The commit above states that it's not possible to have this at the
> moment due to lack of remote and a repository-global config that
> specifies an object filter, yet it's unclear why a remote-specific
> config can't be used instead, which is what this change does.

As I understand it if you're cloning from a bundle file then then there 
is no remote so how can we set a remote-specific config?

I'm surprised that the proposed change does not require the user to pass 
"--filter" to "git clone" as I expected that we'd want to check that the 
filter on the command line was compatible with the filter used to create 
the bundle. Allowing "git clone" to create a partial clone without the 
user asking for it by passing the "--filter" option feels like is going 
to end up confusing users.

> +test_expect_success 'cloning from filtered bundle works' '
> +	git bundle create partial.bdl --all --filter=blob:none &&
> +	git clone --bare partial.bdl partial 2>err

The redirection hides any error message which will make debugging test 
failures harder. It would be nice to see this test check any config set 
when cloning and that git commands can run successfully in the repository.

Best Wishes

Phillip
Nikolay Edigaryev Jan. 14, 2024, 7:39 p.m. UTC | #2
Hello Phillip,

> As I understand it if you're cloning from a bundle file then then there
> is no remote so how can we set a remote-specific config?

There is a remote, for more details see df61c88979 (clone: also
configure url for bare clones, 2010-03-29), which has the following
code:

strbuf_addf(&key, "remote.%s.url", remote_name);
git_config_set(key.buf, repo);
strbuf_reset(&key);

You can verify this by creating a bundle on Git 2.43.0 with "git create
bundle bundle.bundle --all" and then cloning it with "git clone
--bare /path/to/bundle.bundle", in my case the following repo-wide
configuration file was created:

[core]
repositoryformatversion = 0
filemode = true
bare = true
ignorecase = true
precomposeunicode = true
[remote "origin"]
url = /Users/edi/src/cirrus-cli/cli.bundle

> I'm surprised that the proposed change does not require the user to pass
> "--filter" to "git clone" as I expected that we'd want to check that the
> filter on the command line was compatible with the filter used to create
> the bundle. Allowing "git clone" to create a partial clone without the
> user asking for it by passing the "--filter" option feels like is going
> to end up confusing users.

Note that currently, when you clone a normal non-filtered bundle with a
'--filter' argument specified, no filtering will take place and no error
will be thrown. "promisor = true" and "partialclonefilter = ..." options
will be set in the repo config, but no .promisor file will be created.
This is even more confusing IMO, but that's how it currently on
Git 2.43.0.

You have a good point, but I feel like completely preventing cloning of
filtered bundles and requiring a '--filter' argument is very taxing. If
you've already specified a '--filter' when creating a bundle (and thus
your intent to use partially cloned data), why do it multiple times?

What I propose as an alternative here is to act based on the user's
intent when cloning:

* when the user specifies no '--filter' argument, do nothing special,
   allow cloning both types of bundles: normal and filtered (with the
   logic from this patch)

* when the user does specify a '--filter' argument, either:
  * throw an error explaining that filtering of filtered bundles is not
    supported
  * or compare the user's filter specification and the one that is
    in the bundle and only throw an error if they mismatch

Let me know what you think about this (and perhaps you have a more
concrete example in mind where this will have negative consequences)
and I'll be happy to do a next iteration.


On Sun, Jan 14, 2024 at 10:00 PM Phillip Wood <phillip.wood123@gmail.com> wrote:
>
> Hi Nikolay
>
> On 14/01/2024 11:16, Nikolay Edigaryev via GitGitGadget wrote:
> > From: Nikolay Edigaryev <edigaryev@gmail.com>
> >
> > f18b512bbb (bundle: create filtered bundles, 2022-03-09) introduced an
> > incredibly useful ability to create filtered bundles, which advances
> > the partial clone/promisor support in Git and allows for archiving
> > large repositories to object storages like S3 in bundles that are:
> >
> > * easy to manage
> >    * bundle is just a single file, it's easier to guarantee atomic
> >      replacements in object storages like S3 and they are faster to
> >      fetch than a bare repository since there's only a single GET
> >      request involved
> > * incredibly tiny
> >    * no indexes (which may be more than 10 MB for some repositories)
> >      and other fluff, compared to cloning a bare repository
> >    * bundle can be filtered to only contain the tips of refs neccessary
> >      for e.g. code-analysis purposes
> >
> > However, in 86fdd94d72 (clone: fail gracefully when cloning filtered
> > bundle, 2022-03-09) the cloning of such bundles was disabled, with a
> > note that this behavior is not desired, and it the long-term this
> > should be possible.
> >
> > The commit above states that it's not possible to have this at the
> > moment due to lack of remote and a repository-global config that
> > specifies an object filter, yet it's unclear why a remote-specific
> > config can't be used instead, which is what this change does.
>
> As I understand it if you're cloning from a bundle file then then there
> is no remote so how can we set a remote-specific config?
>
> I'm surprised that the proposed change does not require the user to pass
> "--filter" to "git clone" as I expected that we'd want to check that the
> filter on the command line was compatible with the filter used to create
> the bundle. Allowing "git clone" to create a partial clone without the
> user asking for it by passing the "--filter" option feels like is going
> to end up confusing users.
>
> > +test_expect_success 'cloning from filtered bundle works' '
> > +     git bundle create partial.bdl --all --filter=blob:none &&
> > +     git clone --bare partial.bdl partial 2>err
>
> The redirection hides any error message which will make debugging test
> failures harder. It would be nice to see this test check any config set
> when cloning and that git commands can run successfully in the repository.
>
> Best Wishes
>
> Phillip
Nikolay Edigaryev Jan. 14, 2024, 9:26 p.m. UTC | #3
> Note that currently, when you clone a normal non-filtered bundle with a
> '--filter' argument specified, no filtering will take place and no error
> will be thrown. "promisor = true" and "partialclonefilter = ..." options
> will be set in the repo config, but no .promisor file will be created.
> This is even more confusing IMO, but that's how it currently on
> Git 2.43.0.

I was wrong about this one: the .promisor pack file actually gets created.

And the filtering is not being done because the "uploadpack.allowFilter"
global option is not enabled by default.

Unfortunately the only way to figure this out is to run Git with
GIT_TRACE=2, which shows:

warning: filtering not recognized by server, ignoring

So please feel free to skip this part from the consideration.


On Sun, Jan 14, 2024 at 11:39 PM Nikolay Edigaryev <edigaryev@gmail.com> wrote:
>
> Hello Phillip,
>
> > As I understand it if you're cloning from a bundle file then then there
> > is no remote so how can we set a remote-specific config?
>
> There is a remote, for more details see df61c88979 (clone: also
> configure url for bare clones, 2010-03-29), which has the following
> code:
>
> strbuf_addf(&key, "remote.%s.url", remote_name);
> git_config_set(key.buf, repo);
> strbuf_reset(&key);
>
> You can verify this by creating a bundle on Git 2.43.0 with "git create
> bundle bundle.bundle --all" and then cloning it with "git clone
> --bare /path/to/bundle.bundle", in my case the following repo-wide
> configuration file was created:
>
> [core]
> repositoryformatversion = 0
> filemode = true
> bare = true
> ignorecase = true
> precomposeunicode = true
> [remote "origin"]
> url = /Users/edi/src/cirrus-cli/cli.bundle
>
> > I'm surprised that the proposed change does not require the user to pass
> > "--filter" to "git clone" as I expected that we'd want to check that the
> > filter on the command line was compatible with the filter used to create
> > the bundle. Allowing "git clone" to create a partial clone without the
> > user asking for it by passing the "--filter" option feels like is going
> > to end up confusing users.
>
> Note that currently, when you clone a normal non-filtered bundle with a
> '--filter' argument specified, no filtering will take place and no error
> will be thrown. "promisor = true" and "partialclonefilter = ..." options
> will be set in the repo config, but no .promisor file will be created.
> This is even more confusing IMO, but that's how it currently on
> Git 2.43.0.
>
> You have a good point, but I feel like completely preventing cloning of
> filtered bundles and requiring a '--filter' argument is very taxing. If
> you've already specified a '--filter' when creating a bundle (and thus
> your intent to use partially cloned data), why do it multiple times?
>
> What I propose as an alternative here is to act based on the user's
> intent when cloning:
>
> * when the user specifies no '--filter' argument, do nothing special,
>    allow cloning both types of bundles: normal and filtered (with the
>    logic from this patch)
>
> * when the user does specify a '--filter' argument, either:
>   * throw an error explaining that filtering of filtered bundles is not
>     supported
>   * or compare the user's filter specification and the one that is
>     in the bundle and only throw an error if they mismatch
>
> Let me know what you think about this (and perhaps you have a more
> concrete example in mind where this will have negative consequences)
> and I'll be happy to do a next iteration.
>
>
> On Sun, Jan 14, 2024 at 10:00 PM Phillip Wood <phillip.wood123@gmail.com> wrote:
> >
> > Hi Nikolay
> >
> > On 14/01/2024 11:16, Nikolay Edigaryev via GitGitGadget wrote:
> > > From: Nikolay Edigaryev <edigaryev@gmail.com>
> > >
> > > f18b512bbb (bundle: create filtered bundles, 2022-03-09) introduced an
> > > incredibly useful ability to create filtered bundles, which advances
> > > the partial clone/promisor support in Git and allows for archiving
> > > large repositories to object storages like S3 in bundles that are:
> > >
> > > * easy to manage
> > >    * bundle is just a single file, it's easier to guarantee atomic
> > >      replacements in object storages like S3 and they are faster to
> > >      fetch than a bare repository since there's only a single GET
> > >      request involved
> > > * incredibly tiny
> > >    * no indexes (which may be more than 10 MB for some repositories)
> > >      and other fluff, compared to cloning a bare repository
> > >    * bundle can be filtered to only contain the tips of refs neccessary
> > >      for e.g. code-analysis purposes
> > >
> > > However, in 86fdd94d72 (clone: fail gracefully when cloning filtered
> > > bundle, 2022-03-09) the cloning of such bundles was disabled, with a
> > > note that this behavior is not desired, and it the long-term this
> > > should be possible.
> > >
> > > The commit above states that it's not possible to have this at the
> > > moment due to lack of remote and a repository-global config that
> > > specifies an object filter, yet it's unclear why a remote-specific
> > > config can't be used instead, which is what this change does.
> >
> > As I understand it if you're cloning from a bundle file then then there
> > is no remote so how can we set a remote-specific config?
> >
> > I'm surprised that the proposed change does not require the user to pass
> > "--filter" to "git clone" as I expected that we'd want to check that the
> > filter on the command line was compatible with the filter used to create
> > the bundle. Allowing "git clone" to create a partial clone without the
> > user asking for it by passing the "--filter" option feels like is going
> > to end up confusing users.
> >
> > > +test_expect_success 'cloning from filtered bundle works' '
> > > +     git bundle create partial.bdl --all --filter=blob:none &&
> > > +     git clone --bare partial.bdl partial 2>err
> >
> > The redirection hides any error message which will make debugging test
> > failures harder. It would be nice to see this test check any config set
> > when cloning and that git commands can run successfully in the repository.
> >
> > Best Wishes
> >
> > Phillip
Junio C Hamano Jan. 15, 2024, 1:13 a.m. UTC | #4
"Nikolay Edigaryev via GitGitGadget" <gitgitgadget@gmail.com>
writes:

> diff --git a/builtin/clone.c b/builtin/clone.c
> index c6357af9498..4b3fedf78ed 100644
> --- a/builtin/clone.c
> +++ b/builtin/clone.c
> @@ -1227,9 +1227,18 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
>  
>  		if (fd > 0)
>  			close(fd);
> +
> +		if (has_filter) {
> +			strbuf_addf(&key, "remote.%s.promisor", remote_name);
> +			git_config_set(key.buf, "true");
> +			strbuf_reset(&key);
> +
> +			strbuf_addf(&key, "remote.%s.partialclonefilter", remote_name);
> +			git_config_set(key.buf, expand_list_objects_filter_spec(&header.filter));
> +			strbuf_reset(&key);
> +		}
> +

> -# NEEDSWORK: 'git clone --bare' should be able to clone from a filtered
> -# bundle, but that requires a change to promisor/filter config options.

Sorry, but this "should be able to" does not make sense to me in the
first place.

I can understand how an operation to create a narrow clone of an
unfiltered bundle and then prepare the resulting repository for
future "fattening" by naming the unfiltered bundle file a
"promisor", and allow the user to lazily fetch objects that have
initially been filtered out of the bundle lazily.

But a bundle that were created with objects _omitted_ already?  It
obviously cannot "promise" to deliber any objects that ought to be
reachable from the objects contained in it, so setting the bundle
file as promisor in the configuration does not help the resulting
repository.  Those missing objects must be obtained from somewhere
other than that original "filtered" bundle, and if that source of
objects that can promise all the objects that are ought to be
reachable were specified as the promisor, it would make sense.  But
the source of this clone operation, i.e. the bundle file that is
pointed at by "remote.$remote_name.url", cannot be that promisor.
Junio C Hamano Jan. 15, 2024, 2:09 a.m. UTC | #5
Junio C Hamano <gitster@pobox.com> writes:

> "Nikolay Edigaryev via GitGitGadget" <gitgitgadget@gmail.com>
> writes:
>
>> diff --git a/builtin/clone.c b/builtin/clone.c
>> index c6357af9498..4b3fedf78ed 100644
>> --- a/builtin/clone.c
>> +++ b/builtin/clone.c
>> @@ -1227,9 +1227,18 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
>>  
>>  		if (fd > 0)
>>  			close(fd);
>> +
>> +		if (has_filter) {
>> +			strbuf_addf(&key, "remote.%s.promisor", remote_name);
>> +			git_config_set(key.buf, "true");
>> +			strbuf_reset(&key);
>> +
>> +			strbuf_addf(&key, "remote.%s.partialclonefilter", remote_name);
>> +			git_config_set(key.buf, expand_list_objects_filter_spec(&header.filter));
>> +			strbuf_reset(&key);
>> +		}
>> +
>
>> -# NEEDSWORK: 'git clone --bare' should be able to clone from a filtered
>> -# bundle, but that requires a change to promisor/filter config options.
> ...
> But a bundle that were created with objects _omitted_ already?
> ... the source of this clone operation, i.e. the bundle file that is
> pointed at by "remote.$remote_name.url", cannot be that promisor.

Extending the above a bit, one important way a bundle is used is as
a medium for sneaker-net.  Instead of making a full clone over the
network, if you can create a bundle that records all objects and all
refs out of the source repository and then unbundle it in a
different place to create a repository, you can tweak the resulting
repository by either adding a separete remote or changing the
remote.origin.url so that your subsequent fetch goes over the
network to the repository you took the initial bundle from.

The "tweak the resulting repository" part however MUST be done
manually with the current system.  If we can optionally record the
publically reachable URL of the source repository when we create a
bundle file, and "git clone" on the receiving side can read the URL
out of the bundle and act on it (e.g., show it to the user and offer
to record it as remote.origin.url in the resulting repository---I do
not think it is wise to do this silently without letting the user
know from security's point of view), then the use of bundle files as
a medium for sneaker-netting will become even easier.

And once that is done, perhaps allowing a filtered bundle to act as
a sneaker-net medium to simulate an initial filtered clone would
make sense.  The promisor as well as the origin will be the network
reachable URL and subsequent fetches (both deliberate ones via "git
fetch" as well as lazy on-demand ones that backfills missing objects
via the "promisor" access) would become possible.

But without such a change to the bundle file format, allowing
"clone" to finish and pretend the resulting repository is usable is
somewhat irresponsible to the users.  The on-demand lazy fetch would
fail after this code cloned from such a filtered bundle, no?
Phillip Wood Jan. 15, 2024, 10:18 a.m. UTC | #6
Hi Nikolay

On 14/01/2024 19:39, Nikolay Edigaryev wrote:
> Hello Phillip,
> 
>> As I understand it if you're cloning from a bundle file then then there
>> is no remote so how can we set a remote-specific config?
> 
> There is a remote, for more details see df61c88979 (clone: also
> configure url for bare clones, 2010-03-29), which has the following
> code:
> 
> strbuf_addf(&key, "remote.%s.url", remote_name);
> git_config_set(key.buf, repo);
> strbuf_reset(&key);
> 
> You can verify this by creating a bundle on Git 2.43.0 with "git create
> bundle bundle.bundle --all" and then cloning it with "git clone
> --bare /path/to/bundle.bundle", in my case the following repo-wide
> configuration file was created:
> 
> [core]
> repositoryformatversion = 0
> filemode = true
> bare = true
> ignorecase = true
> precomposeunicode = true
> [remote "origin"]
> url = /Users/edi/src/cirrus-cli/cli.bundle

Oh, thanks for clarifying that I didn't realize we set "origin" to point 
to the bundle. That means this patch creates a promisor remote config 
pointing to a bundle that does not contain the missing objects. As Junio 
said that doesn't make much sense to me as the point of the promisor 
config is to allow git to lazily fetch the missing objects.

Best Wishes

Phillip

>> I'm surprised that the proposed change does not require the user to pass
>> "--filter" to "git clone" as I expected that we'd want to check that the
>> filter on the command line was compatible with the filter used to create
>> the bundle. Allowing "git clone" to create a partial clone without the
>> user asking for it by passing the "--filter" option feels like is going
>> to end up confusing users.
> 
> Note that currently, when you clone a normal non-filtered bundle with a
> '--filter' argument specified, no filtering will take place and no error
> will be thrown. "promisor = true" and "partialclonefilter = ..." options
> will be set in the repo config, but no .promisor file will be created.
> This is even more confusing IMO, but that's how it currently on
> Git 2.43.0.
> 
> You have a good point, but I feel like completely preventing cloning of
> filtered bundles and requiring a '--filter' argument is very taxing. If
> you've already specified a '--filter' when creating a bundle (and thus
> your intent to use partially cloned data), why do it multiple times?
> 
> What I propose as an alternative here is to act based on the user's
> intent when cloning:
> 
> * when the user specifies no '--filter' argument, do nothing special,
>     allow cloning both types of bundles: normal and filtered (with the
>     logic from this patch)
> 
> * when the user does specify a '--filter' argument, either:
>    * throw an error explaining that filtering of filtered bundles is not
>      supported
>    * or compare the user's filter specification and the one that is
>      in the bundle and only throw an error if they mismatch
> 
> Let me know what you think about this (and perhaps you have a more
> concrete example in mind where this will have negative consequences)
> and I'll be happy to do a next iteration.
> 
> 
> On Sun, Jan 14, 2024 at 10:00 PM Phillip Wood <phillip.wood123@gmail.com> wrote:
>>
>> Hi Nikolay
>>
>> On 14/01/2024 11:16, Nikolay Edigaryev via GitGitGadget wrote:
>>> From: Nikolay Edigaryev <edigaryev@gmail.com>
>>>
>>> f18b512bbb (bundle: create filtered bundles, 2022-03-09) introduced an
>>> incredibly useful ability to create filtered bundles, which advances
>>> the partial clone/promisor support in Git and allows for archiving
>>> large repositories to object storages like S3 in bundles that are:
>>>
>>> * easy to manage
>>>     * bundle is just a single file, it's easier to guarantee atomic
>>>       replacements in object storages like S3 and they are faster to
>>>       fetch than a bare repository since there's only a single GET
>>>       request involved
>>> * incredibly tiny
>>>     * no indexes (which may be more than 10 MB for some repositories)
>>>       and other fluff, compared to cloning a bare repository
>>>     * bundle can be filtered to only contain the tips of refs neccessary
>>>       for e.g. code-analysis purposes
>>>
>>> However, in 86fdd94d72 (clone: fail gracefully when cloning filtered
>>> bundle, 2022-03-09) the cloning of such bundles was disabled, with a
>>> note that this behavior is not desired, and it the long-term this
>>> should be possible.
>>>
>>> The commit above states that it's not possible to have this at the
>>> moment due to lack of remote and a repository-global config that
>>> specifies an object filter, yet it's unclear why a remote-specific
>>> config can't be used instead, which is what this change does.
>>
>> As I understand it if you're cloning from a bundle file then then there
>> is no remote so how can we set a remote-specific config?
>>
>> I'm surprised that the proposed change does not require the user to pass
>> "--filter" to "git clone" as I expected that we'd want to check that the
>> filter on the command line was compatible with the filter used to create
>> the bundle. Allowing "git clone" to create a partial clone without the
>> user asking for it by passing the "--filter" option feels like is going
>> to end up confusing users.
>>
>>> +test_expect_success 'cloning from filtered bundle works' '
>>> +     git bundle create partial.bdl --all --filter=blob:none &&
>>> +     git clone --bare partial.bdl partial 2>err
>>
>> The redirection hides any error message which will make debugging test
>> failures harder. It would be nice to see this test check any config set
>> when cloning and that git commands can run successfully in the repository.
>>
>> Best Wishes
>>
>> Phillip
Phillip Wood Jan. 15, 2024, 10:35 a.m. UTC | #7
Hi Nikolay

On 14/01/2024 21:26, Nikolay Edigaryev wrote:
>> Note that currently, when you clone a normal non-filtered bundle with a
>> '--filter' argument specified, no filtering will take place and no error
>> will be thrown. "promisor = true" and "partialclonefilter = ..." options
>> will be set in the repo config, but no .promisor file will be created.
>> This is even more confusing IMO, but that's how it currently on
>> Git 2.43.0.
> 
> I was wrong about this one: the .promisor pack file actually gets created.
> 
> And the filtering is not being done because the "uploadpack.allowFilter"
> global option is not enabled by default.

That makes sense - if you try to make a partial clone from a remote that 
does not support object filtering we fallback to a full clone and print 
the warning you mention below.

> Unfortunately the only way to figure this out is to run Git with
> GIT_TRACE=2, which shows:

That seems strange, you should see the warning without having to set 
GIT_TRACE. I've certainly seen the warning in the past when trying to 
create a partial clone from a remote did not support them without me 
setting any special environment variables.

Best Wishes

Phillip

> warning: filtering not recognized by server, ignoring
> 
> So please feel free to skip this part from the consideration.
> 
> 
> On Sun, Jan 14, 2024 at 11:39 PM Nikolay Edigaryev <edigaryev@gmail.com> wrote:
>>
>> Hello Phillip,
>>
>>> As I understand it if you're cloning from a bundle file then then there
>>> is no remote so how can we set a remote-specific config?
>>
>> There is a remote, for more details see df61c88979 (clone: also
>> configure url for bare clones, 2010-03-29), which has the following
>> code:
>>
>> strbuf_addf(&key, "remote.%s.url", remote_name);
>> git_config_set(key.buf, repo);
>> strbuf_reset(&key);
>>
>> You can verify this by creating a bundle on Git 2.43.0 with "git create
>> bundle bundle.bundle --all" and then cloning it with "git clone
>> --bare /path/to/bundle.bundle", in my case the following repo-wide
>> configuration file was created:
>>
>> [core]
>> repositoryformatversion = 0
>> filemode = true
>> bare = true
>> ignorecase = true
>> precomposeunicode = true
>> [remote "origin"]
>> url = /Users/edi/src/cirrus-cli/cli.bundle
>>
>>> I'm surprised that the proposed change does not require the user to pass
>>> "--filter" to "git clone" as I expected that we'd want to check that the
>>> filter on the command line was compatible with the filter used to create
>>> the bundle. Allowing "git clone" to create a partial clone without the
>>> user asking for it by passing the "--filter" option feels like is going
>>> to end up confusing users.
>>
>> Note that currently, when you clone a normal non-filtered bundle with a
>> '--filter' argument specified, no filtering will take place and no error
>> will be thrown. "promisor = true" and "partialclonefilter = ..." options
>> will be set in the repo config, but no .promisor file will be created.
>> This is even more confusing IMO, but that's how it currently on
>> Git 2.43.0.
>>
>> You have a good point, but I feel like completely preventing cloning of
>> filtered bundles and requiring a '--filter' argument is very taxing. If
>> you've already specified a '--filter' when creating a bundle (and thus
>> your intent to use partially cloned data), why do it multiple times?
>>
>> What I propose as an alternative here is to act based on the user's
>> intent when cloning:
>>
>> * when the user specifies no '--filter' argument, do nothing special,
>>     allow cloning both types of bundles: normal and filtered (with the
>>     logic from this patch)
>>
>> * when the user does specify a '--filter' argument, either:
>>    * throw an error explaining that filtering of filtered bundles is not
>>      supported
>>    * or compare the user's filter specification and the one that is
>>      in the bundle and only throw an error if they mismatch
>>
>> Let me know what you think about this (and perhaps you have a more
>> concrete example in mind where this will have negative consequences)
>> and I'll be happy to do a next iteration.
>>
>>
>> On Sun, Jan 14, 2024 at 10:00 PM Phillip Wood <phillip.wood123@gmail.com> wrote:
>>>
>>> Hi Nikolay
>>>
>>> On 14/01/2024 11:16, Nikolay Edigaryev via GitGitGadget wrote:
>>>> From: Nikolay Edigaryev <edigaryev@gmail.com>
>>>>
>>>> f18b512bbb (bundle: create filtered bundles, 2022-03-09) introduced an
>>>> incredibly useful ability to create filtered bundles, which advances
>>>> the partial clone/promisor support in Git and allows for archiving
>>>> large repositories to object storages like S3 in bundles that are:
>>>>
>>>> * easy to manage
>>>>     * bundle is just a single file, it's easier to guarantee atomic
>>>>       replacements in object storages like S3 and they are faster to
>>>>       fetch than a bare repository since there's only a single GET
>>>>       request involved
>>>> * incredibly tiny
>>>>     * no indexes (which may be more than 10 MB for some repositories)
>>>>       and other fluff, compared to cloning a bare repository
>>>>     * bundle can be filtered to only contain the tips of refs neccessary
>>>>       for e.g. code-analysis purposes
>>>>
>>>> However, in 86fdd94d72 (clone: fail gracefully when cloning filtered
>>>> bundle, 2022-03-09) the cloning of such bundles was disabled, with a
>>>> note that this behavior is not desired, and it the long-term this
>>>> should be possible.
>>>>
>>>> The commit above states that it's not possible to have this at the
>>>> moment due to lack of remote and a repository-global config that
>>>> specifies an object filter, yet it's unclear why a remote-specific
>>>> config can't be used instead, which is what this change does.
>>>
>>> As I understand it if you're cloning from a bundle file then then there
>>> is no remote so how can we set a remote-specific config?
>>>
>>> I'm surprised that the proposed change does not require the user to pass
>>> "--filter" to "git clone" as I expected that we'd want to check that the
>>> filter on the command line was compatible with the filter used to create
>>> the bundle. Allowing "git clone" to create a partial clone without the
>>> user asking for it by passing the "--filter" option feels like is going
>>> to end up confusing users.
>>>
>>>> +test_expect_success 'cloning from filtered bundle works' '
>>>> +     git bundle create partial.bdl --all --filter=blob:none &&
>>>> +     git clone --bare partial.bdl partial 2>err
>>>
>>> The redirection hides any error message which will make debugging test
>>> failures harder. It would be nice to see this test check any config set
>>> when cloning and that git commands can run successfully in the repository.
>>>
>>> Best Wishes
>>>
>>> Phillip
Nikolay Edigaryev Jan. 16, 2024, 8:06 p.m. UTC | #8
Hello Junio and Phillip,

Thanks a lot for the explanations of how this is supposed to work. It
seems that to make this work properly, we'd need to:

(1) add an argument (or an option) to 'git bundle create', so that
    the user will be able to explicitly request the inclusion of a
    desired remote's URL

Without such mechanism in place data leak is possible, e.g. remote with
credentials hardcoded in it.

(2) extend the 'gitformat-bundle' to include 'url'

However, a remote can have multiple URLs and other remote-specific
options might be necessary to properly work with it.

(3) add an argument (or an option) to 'git clone', so that the user
    will be able to explicitly request the write of the URL contained
    in the bundle to the repository's config

Otherwise, it's insecure, e.g. someone might craft a bundle with a URL
that collects data from the user.

I don't want waste anyone's time on this anymore because I've toyed with
'git bundle' a bit more and realized that what I'm trying to accomplish
can be done the other way:

1. git init

2. git bundle unbundle <PATH> | <script that swaps hashes and refs in
   'git bundle unbundle output' and feeds them to 'git update-ref'>

Hopefully this discussion will be useful for people looking to
accomplish something similar to what I've described in the initial
message.

On Mon, Jan 15, 2024 at 6:09 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Junio C Hamano <gitster@pobox.com> writes:
>
> > "Nikolay Edigaryev via GitGitGadget" <gitgitgadget@gmail.com>
> > writes:
> >
> >> diff --git a/builtin/clone.c b/builtin/clone.c
> >> index c6357af9498..4b3fedf78ed 100644
> >> --- a/builtin/clone.c
> >> +++ b/builtin/clone.c
> >> @@ -1227,9 +1227,18 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
> >>
> >>              if (fd > 0)
> >>                      close(fd);
> >> +
> >> +            if (has_filter) {
> >> +                    strbuf_addf(&key, "remote.%s.promisor", remote_name);
> >> +                    git_config_set(key.buf, "true");
> >> +                    strbuf_reset(&key);
> >> +
> >> +                    strbuf_addf(&key, "remote.%s.partialclonefilter", remote_name);
> >> +                    git_config_set(key.buf, expand_list_objects_filter_spec(&header.filter));
> >> +                    strbuf_reset(&key);
> >> +            }
> >> +
> >
> >> -# NEEDSWORK: 'git clone --bare' should be able to clone from a filtered
> >> -# bundle, but that requires a change to promisor/filter config options.
> > ...
> > But a bundle that were created with objects _omitted_ already?
> > ... the source of this clone operation, i.e. the bundle file that is
> > pointed at by "remote.$remote_name.url", cannot be that promisor.
>
> Extending the above a bit, one important way a bundle is used is as
> a medium for sneaker-net.  Instead of making a full clone over the
> network, if you can create a bundle that records all objects and all
> refs out of the source repository and then unbundle it in a
> different place to create a repository, you can tweak the resulting
> repository by either adding a separete remote or changing the
> remote.origin.url so that your subsequent fetch goes over the
> network to the repository you took the initial bundle from.
>
> The "tweak the resulting repository" part however MUST be done
> manually with the current system.  If we can optionally record the
> publically reachable URL of the source repository when we create a
> bundle file, and "git clone" on the receiving side can read the URL
> out of the bundle and act on it (e.g., show it to the user and offer
> to record it as remote.origin.url in the resulting repository---I do
> not think it is wise to do this silently without letting the user
> know from security's point of view), then the use of bundle files as
> a medium for sneaker-netting will become even easier.
>
> And once that is done, perhaps allowing a filtered bundle to act as
> a sneaker-net medium to simulate an initial filtered clone would
> make sense.  The promisor as well as the origin will be the network
> reachable URL and subsequent fetches (both deliberate ones via "git
> fetch" as well as lazy on-demand ones that backfills missing objects
> via the "promisor" access) would become possible.
>
> But without such a change to the bundle file format, allowing
> "clone" to finish and pretend the resulting repository is usable is
> somewhat irresponsible to the users.  The on-demand lazy fetch would
> fail after this code cloned from such a filtered bundle, no?
diff mbox series

Patch

diff --git a/builtin/clone.c b/builtin/clone.c
index c6357af9498..4b3fedf78ed 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -1227,9 +1227,18 @@  int cmd_clone(int argc, const char **argv, const char *prefix)
 
 		if (fd > 0)
 			close(fd);
+
+		if (has_filter) {
+			strbuf_addf(&key, "remote.%s.promisor", remote_name);
+			git_config_set(key.buf, "true");
+			strbuf_reset(&key);
+
+			strbuf_addf(&key, "remote.%s.partialclonefilter", remote_name);
+			git_config_set(key.buf, expand_list_objects_filter_spec(&header.filter));
+			strbuf_reset(&key);
+		}
+
 		bundle_header_release(&header);
-		if (has_filter)
-			die(_("cannot clone from filtered bundle"));
 	}
 
 	transport_set_option(transport, TRANS_OPT_KEEP, "yes");
diff --git a/t/t6020-bundle-misc.sh b/t/t6020-bundle-misc.sh
index 3e6bcbf30cd..f449df00642 100755
--- a/t/t6020-bundle-misc.sh
+++ b/t/t6020-bundle-misc.sh
@@ -555,16 +555,9 @@  do
 	'
 done
 
-# NEEDSWORK: 'git clone --bare' should be able to clone from a filtered
-# bundle, but that requires a change to promisor/filter config options.
-# For now, we fail gracefully with a helpful error. This behavior can be
-# changed in the future to succeed as much as possible.
-test_expect_success 'cloning from filtered bundle has useful error' '
-	git bundle create partial.bdl \
-		--all \
-		--filter=blob:none &&
-	test_must_fail git clone --bare partial.bdl partial 2>err &&
-	grep "cannot clone from filtered bundle" err
+test_expect_success 'cloning from filtered bundle works' '
+	git bundle create partial.bdl --all --filter=blob:none &&
+	git clone --bare partial.bdl partial 2>err
 '
 
 test_expect_success 'verify catches unreachable, broken prerequisites' '