[v2,0/4,RFC] repack: add --filter=

Message ID	pull.1206.v2.git.git.1644372606.gitgitgadget@gmail.com (mailing list archive)
Headers	show Return-Path: <git-owner@kernel.org> Message-Id: <pull.1206.v2.git.git.1644372606.gitgitgadget@gmail.com> In-Reply-To: <pull.1206.git.git.1643248180.gitgitgadget@gmail.com> References: <pull.1206.git.git.1643248180.gitgitgadget@gmail.com> From: "John Cai via GitGitGadget" <gitgitgadget@gmail.com> Date: Wed, 09 Feb 2022 02:10:02 +0000 Subject: [PATCH v2 0/4] [RFC] repack: add --filter= Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 To: git@vger.kernel.org Cc: John Cai <johncai86@gmail.com> Precedence: bulk
Series	repack: add --filter= \| expand [v2,0/4,RFC] repack: add --filter= [v2,1/4] pack-objects: allow --filter without --stdout [v2,2/4] repack: add --filter=<filter-spec> option [v2,3/4] upload-pack: allow missing promisor objects [v2,4/4] tests for repack --filter mode

Jean-Noël Avila via GitGitGadget Feb. 9, 2022, 2:10 a.m. UTC

This patch series makes partial clone more useful by making it possible to
run repack to remove objects from a repository (replacing it with promisor
objects). This is useful when we want to offload large blobs from a git
server onto another git server, or even use an http server through a remote
helper.

In [A], a --refilter option on fetch and fetch-pack is being discussed where
either a less restrictive or more restrictive filter can be used. In the
more restrictive case, the objects that already exist will not be deleted.
But, one can imagine that users might want the ability to delete objects
when they apply a more restrictive filter in order to save space, and this
patch series would also allow that.

There are a couple of things we need to adjust to make this possible. This
patch has three parts.

 1. Allow --filter in pack-objects without --stdout
 2. Add a --filter flag for repack
 3. Allow missing promisor objects in upload-pack
 4. Tests that demonstrate the ability to offload objects onto an http
    remote

cc: Christian Couder christian.couder@gmail.com cc: Derrick Stolee
stolee@gmail.com cc: Robert Coup robert@coup.net.nz

A.
https://lore.kernel.org/git/pull.1138.git.1643730593.gitgitgadget@gmail.com/

John Cai (4):
  pack-objects: allow --filter without --stdout
  repack: add --filter=<filter-spec> option
  upload-pack: allow missing promisor objects
  tests for repack --filter mode

 Documentation/git-repack.txt   |   5 +
 builtin/pack-objects.c         |   2 -
 builtin/repack.c               |  22 +++--
 t/lib-httpd.sh                 |   2 +
 t/lib-httpd/apache.conf        |   8 ++
 t/lib-httpd/list.sh            |  43 +++++++++
 t/lib-httpd/upload.sh          |  46 +++++++++
 t/t0410-partial-clone.sh       |  81 ++++++++++++++++
 t/t0410/git-remote-testhttpgit | 170 +++++++++++++++++++++++++++++++++
 t/t7700-repack.sh              |  20 ++++
 upload-pack.c                  |   5 +
 11 files changed, 395 insertions(+), 9 deletions(-)
 create mode 100644 t/lib-httpd/list.sh
 create mode 100644 t/lib-httpd/upload.sh
 create mode 100755 t/t0410/git-remote-testhttpgit


base-commit: 38062e73e009f27ea192d50481fcb5e7b0e9d6eb
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1206%2Fjohn-cai%2Fjc-repack-filter-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1206/john-cai/jc-repack-filter-v2
Pull-Request: https://github.com/git/git/pull/1206

Range-diff vs v1:

 1:  0eec9b117da = 1:  f43b76ca650 pack-objects: allow --filter without --stdout
 -:  ----------- > 2:  6e7c8410b8d repack: add --filter=<filter-spec> option
 -:  ----------- > 3:  40612b9663b upload-pack: allow missing promisor objects
 2:  a3166381572 ! 4:  d76faa1f16e repack: add --filter=<filter-spec> option
     @@ Metadata
      Author: John Cai <johncai86@gmail.com>
      
       ## Commit message ##
     -    repack: add --filter=<filter-spec> option
     +    tests for repack --filter mode
      
     -    Currently, repack does not work with partial clones. When repack is run
     -    on a partially cloned repository, it grabs all missing objects from
     -    promisor remotes. This also means that when gc is run for repository
     -    maintenance on a partially cloned repository, it will end up getting
     -    missing objects, which is not what we want.
     -
     -    In order to make repack work with partial clone, teach repack a new
     -    option --filter, which takes a <filter-spec> argument. repack will skip
     -    any objects that are matched by <filter-spec> similar to how the clone
     -    command will skip fetching certain objects.
     -
     -    The final goal of this feature, is to be able to store objects on a
     -    server other than the regular git server itself.
     +    This patch adds tests to test both repack --filter functionality in
     +    isolation (in t7700-repack.sh) as well as how it can be used to offload
     +    large blobs (in t0410-partial-clone.sh)
      
          There are several scripts added so we can test the process of using a
     -    remote helper to upload blobs to an http server:
     +    remote helper to upload blobs to an http server.
      
          - t/lib-httpd/list.sh lists blobs uploaded to the http server.
          - t/lib-httpd/upload.sh uploads blobs to the http server.
     @@ Commit message
          Based-on-patch-by: Christian Couder <chriscool@tuxfamily.org>
          Signed-off-by: John Cai <johncai86@gmail.com>
      
     - ## Documentation/git-repack.txt ##
     -@@ Documentation/git-repack.txt: depth is 4095.
     - 	a larger and slower repository; see the discussion in
     - 	`pack.packSizeLimit`.
     - 
     -+--filter=<filter-spec>::
     -+	Omits certain objects (usually blobs) from the resulting
     -+	packfile. See linkgit:git-rev-list[1] for valid
     -+	`<filter-spec>` forms.
     -+
     - -b::
     - --write-bitmap-index::
     - 	Write a reachability bitmap index as part of the repack. This
     -
     - ## builtin/repack.c ##
     -@@ builtin/repack.c: struct pack_objects_args {
     - 	const char *depth;
     - 	const char *threads;
     - 	const char *max_pack_size;
     -+	const char *filter;
     - 	int no_reuse_delta;
     - 	int no_reuse_object;
     - 	int quiet;
     -@@ builtin/repack.c: static void prepare_pack_objects(struct child_process *cmd,
     - 		strvec_pushf(&cmd->args, "--threads=%s", args->threads);
     - 	if (args->max_pack_size)
     - 		strvec_pushf(&cmd->args, "--max-pack-size=%s", args->max_pack_size);
     -+	if (args->filter)
     -+		strvec_pushf(&cmd->args, "--filter=%s", args->filter);
     - 	if (args->no_reuse_delta)
     - 		strvec_pushf(&cmd->args, "--no-reuse-delta");
     - 	if (args->no_reuse_object)
     -@@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
     - 				N_("limits the maximum number of threads")),
     - 		OPT_STRING(0, "max-pack-size", &po_args.max_pack_size, N_("bytes"),
     - 				N_("maximum size of each packfile")),
     -+		OPT_STRING(0, "filter", &po_args.filter, N_("args"),
     -+				N_("object filtering")),
     - 		OPT_BOOL(0, "pack-kept-objects", &pack_kept_objects,
     - 				N_("repack objects in packs marked with .keep")),
     - 		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
     -@@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
     - 		if (line.len != the_hash_algo->hexsz)
     - 			die(_("repack: Expecting full hex object ID lines only from pack-objects."));
     - 		string_list_append(&names, line.buf);
     -+		if (po_args.filter) {
     -+			char *promisor_name = mkpathdup("%s-%s.promisor", packtmp,
     -+							line.buf);
     -+			write_promisor_file(promisor_name, NULL, 0);
     -+		}
     - 	}
     - 	fclose(out);
     - 	ret = finish_command(&cmd);
     -
       ## t/lib-httpd.sh ##
      @@ t/lib-httpd.sh: prepare_httpd() {
       	install_script error-smart-http.sh
     @@ t/t0410-partial-clone.sh: test_expect_success 'fetching of missing objects from
      +	git -C server rev-list --objects --all --missing=print >objects &&
      +	grep "$sha" objects
      +'
     ++
     ++test_expect_success 'fetch does not cause server to fetch missing objects' '
     ++	rm -rf origin server client &&
     ++	test_create_repo origin &&
     ++	dd if=/dev/zero of=origin/file1 bs=801k count=1 &&
     ++	git -C origin add file1 &&
     ++	git -C origin commit -m "large blob" &&
     ++	sha="$(git -C origin rev-parse :file1)" &&
     ++	expected="?$(git -C origin rev-parse :file1)" &&
     ++	git clone --bare --no-local origin server &&
     ++	git -C server remote add httpremote "testhttpgit::${PWD}/server" &&
     ++	git -C server config remote.httpremote.promisor true &&
     ++	git -C server config --remove-section remote.origin &&
     ++	git -C server rev-list --all --objects --filter-print-omitted \
     ++		--filter=blob:limit=800k | perl -ne "print if s/^[~]//" \
     ++		>large_blobs.txt &&
     ++	upload_blobs_from_stdin server <large_blobs.txt &&
     ++	git -C server -c repack.writebitmaps=false repack -a -d \
     ++		--filter=blob:limit=800k &&
     ++	git -C server config uploadpack.allowmissingpromisor true &&
     ++	git clone -c remote.httpremote.url="testhttpgit::${PWD}/server" \
     ++	-c remote.httpremote.fetch='+refs/heads/*:refs/remotes/httpremote/*' \
     ++	-c remote.httpremote.promisor=true --bare --no-local \
     ++	--filter=blob:limit=800k server client &&
     ++	git -C client rev-list --objects --all --missing=print >client_objects &&
     ++	grep "$expected" client_objects &&
     ++	git -C server rev-list --objects --all --missing=print >server_objects &&
     ++	grep "$expected" server_objects
     ++'
      +
       # DO NOT add non-httpd-specific tests here, because the last part of this
       # test script is only executed when httpd is available and enabled.

Robert Coup Feb. 16, 2022, 3:39 p.m. UTC | #1

Hi John,

On Wed, 9 Feb 2022 at 02:41, John Cai via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> This patch series makes partial clone more useful by making it possible to
> run repack to remove objects from a repository (replacing it with promisor
> objects). This is useful when we want to offload large blobs from a git
> server onto another git server, or even use an http server through a remote
> helper.
>
> In [A], a --refilter option on fetch and fetch-pack is being discussed where
> either a less restrictive or more restrictive filter can be used. In the
> more restrictive case, the objects that already exist will not be deleted.
> But, one can imagine that users might want the ability to delete objects
> when they apply a more restrictive filter in order to save space, and this
> patch series would also allow that.

This all makes sense to me, and the implementation is remarkably short -
gluing together capabilities that are already there, and writing tests.

*But*, running `repack --filter` drops objects from the object db.
That seems like
a capability Git shouldn't idly expose without people understanding the
consequences - mostly that they really have another copy elsewhere or they
will lose data, and it won't necessarily be obvious for a long time. Otherwise
it is a footgun.

I don't know whether that is just around naming (--delete-filter /
--drop-filter /
--expire-filter ?), and/or making the documentation very explicit that
this isn't so
much "omitting certain objects from a packfile" as irretrievably
deleting objects.

Rob :)

John Cai Feb. 16, 2022, 9:07 p.m. UTC | #2

Hi Rob,

glad these two efforts dovetail nicely!

On 16 Feb 2022, at 10:39, Robert Coup wrote:

> Hi John,
>
> On Wed, 9 Feb 2022 at 02:41, John Cai via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> This patch series makes partial clone more useful by making it possible to
>> run repack to remove objects from a repository (replacing it with promisor
>> objects). This is useful when we want to offload large blobs from a git
>> server onto another git server, or even use an http server through a remote
>> helper.
>>
>> In [A], a --refilter option on fetch and fetch-pack is being discussed where
>> either a less restrictive or more restrictive filter can be used. In the
>> more restrictive case, the objects that already exist will not be deleted.
>> But, one can imagine that users might want the ability to delete objects
>> when they apply a more restrictive filter in order to save space, and this
>> patch series would also allow that.
>
> This all makes sense to me, and the implementation is remarkably short -
> gluing together capabilities that are already there, and writing tests.
>
> *But*, running `repack --filter` drops objects from the object db.
> That seems like
> a capability Git shouldn't idly expose without people understanding the
> consequences - mostly that they really have another copy elsewhere or they
> will lose data, and it won't necessarily be obvious for a long time. Otherwise
> it is a footgun.

Yes, great point. I think there was concern from Stolee around this as well.
>
> I don't know whether that is just around naming (--delete-filter /
> --drop-filter /
> --expire-filter ?), and/or making the documentation very explicit that
> this isn't so
> much "omitting certain objects from a packfile" as irretrievably
> deleting objects.

Yeah, making the name very clear (I kind of like --delete-filter) would certainly help.
Also, to have more protection we can either

1. add a config value that needs to be set to true for repack to remove
objects (repack.allowDestroyFilter).

2. --filter is dry-run by default and prints out objects that would have been removed,
and it has to be combined with another flag --destroy in order for it to actually remove
objects from the odb.

>
> Rob :)

Taylor Blau Feb. 21, 2022, 3:11 a.m. UTC | #3

On Wed, Feb 16, 2022 at 04:07:14PM -0500, John Cai wrote:
> > I don't know whether that is just around naming (--delete-filter /
> > --drop-filter /
> > --expire-filter ?), and/or making the documentation very explicit that
> > this isn't so
> > much "omitting certain objects from a packfile" as irretrievably
> > deleting objects.
>
> Yeah, making the name very clear (I kind of like --delete-filter) would certainly help.
> Also, to have more protection we can either
>
> 1. add a config value that needs to be set to true for repack to remove
> objects (repack.allowDestroyFilter).
>
> 2. --filter is dry-run by default and prints out objects that would have been removed,
> and it has to be combined with another flag --destroy in order for it to actually remove
> objects from the odb.

I share the same concern as Robert and Stolee do. But I think this issue
goes deeper than just naming.

Even if we called this `git repack --delete-filter` and only ran it with
`--i-know-what-im-doing` flag, we would still be leaving repository
corruption on the table, just making it marginally more difficult to
achieve.

I'm not familiar enough with the proposal to comment authoritatively,
but it seems like we should be verifying that there is a promisor remote
which promises any objects that we are about to filter out of the
repository.

I think that this is basically what `pack-objects`'s
`--missing=allow-promisor` does, though I don't think that's the right
tool for this job, either. Because we pack-objects also knows the object
filter, by the time we are ready to construct a pack, we're traversing
the filtered list of objects.

So we don't even bother to call show_object (or, in this case,
builtin/pack-objects.c::show_objecT__ma_allow_promisor) on them.

So I wonder what your thoughts are on having pack-objects only allow an
object to get "filtered out" if a copy of it is promised by some
promisor remote. Alternatively, and perhaps a more straight-forward
option might be to have `git repack` look at any objects that exist in a
pack we're about to delete, but don't exist in any of the packs we are
going to leave around, and make sure that any of those objects are
either unreachable or exist on a promisor remote.

But as it stands right now, I worry that this feature is too easily
misused and could result in unintended repository corruption.

I think verifying that that any objects we're about to delete exist
somewhere should make this safer to use, though even then, I think we're
still open to a TOCTOU race whereby the promisor has the objects when
we're about to delete them (convincing Git that deleting those objects
is OK to do) but gets rid of them after objects have been deleted from
the local copy (leaving no copies of the object around).

So, I don't know exactly what the right path forward is. But I'm curious
to get your thoughts on the above.

Thanks,
Taylor

Robert Coup Feb. 21, 2022, 3:38 p.m. UTC | #4

Hi,

On Mon, 21 Feb 2022 at 03:11, Taylor Blau <me@ttaylorr.com> wrote:
>
> we would still be leaving repository
> corruption on the table, just making it marginally more difficult to
> achieve.

While reviewing John's patch I initially wondered if a better approach
might be something like `git repack -a -d --exclude-stdin`, passing a
list of specific objects to exclude from the new pack (sourced from
rev-list via a filter, etc). To me this seems like a less dangerous
approach, but my concern was it doesn't use the existing filter
capabilities of pack-objects, and we end up generating and passing
around a huge list of oids. And of course any mistakes in the list
generation aren't visible until it's too late.

I also wonder whether there's a race condition if the repository gets
updated? If you're moving large objects out in advance, then filtering
the remainder there's nothing to stop a new large object being pushed
between those two steps and getting dropped.

My other idea, which is growing on me, is whether repack could
generate two valid packs: one for the included objects via the filter
(as John's change does now), and one containing the filtered-out
objects. `git repack -a -d --split-filter=<filter>` Then a user could
then move/extract the second packfile to object storage, but there'd
be no way to *accidentally* corrupt the repository by using a bad
option. With this approach the race condition above goes away too.

    $ git repack -a -d -q --split-filter=blob:limit=1m
    pack-37b7443e3123549a2ddee31f616ae272c51cae90
    pack-10789d94fcd99ffe1403b63b167c181a9df493cd

First pack identifier being the objects that match the filter (ie:
commits/trees/blobs <1m), and the second pack identifier being the
objects that are excluded by the filter (blobs >1m).

An astute --i-know-what-im-doing reader could spot that you could just
delete the second packfile and achieve the same outcome as the current
proposed patch, subject to being confident the race condition hadn't
happened to you.

Thanks,
Rob :)

Taylor Blau Feb. 21, 2022, 5:57 p.m. UTC | #5

On Mon, Feb 21, 2022 at 03:38:03PM +0000, Robert Coup wrote:
> Hi,
>
> On Mon, 21 Feb 2022 at 03:11, Taylor Blau <me@ttaylorr.com> wrote:
> >
> > we would still be leaving repository
> > corruption on the table, just making it marginally more difficult to
> > achieve.
>
> While reviewing John's patch I initially wondered if a better approach
> might be something like `git repack -a -d --exclude-stdin`, passing a
> list of specific objects to exclude from the new pack (sourced from
> rev-list via a filter, etc). To me this seems like a less dangerous
> approach, but my concern was it doesn't use the existing filter
> capabilities of pack-objects, and we end up generating and passing
> around a huge list of oids. And of course any mistakes in the list
> generation aren't visible until it's too late.

Yeah; I think the most elegant approach would have pack-objects do as
much work as possible, and have repack be in charge of coordinating what
all the pack-objects invocation(s) have to do.

> I also wonder whether there's a race condition if the repository gets
> updated? If you're moving large objects out in advance, then filtering
> the remainder there's nothing to stop a new large object being pushed
> between those two steps and getting dropped.

Yeah, we will want to make sure that we're operating on a consistent
view of the repository. If this is all done in-process, it won't be a
problem since we'll capture an atomic snapshot of the reference states
once. If this is done across multiple processes, we'll need to make sure
we're passing around that snapshot where appropriate.

See the `--refs-snapshot`-related code in git-repack for when we write a
multi-pack bitmap for an example of the latter.

> My other idea, which is growing on me, is whether repack could
> generate two valid packs: one for the included objects via the filter
> (as John's change does now), and one containing the filtered-out
> objects. `git repack -a -d --split-filter=<filter>` Then a user could
> then move/extract the second packfile to object storage, but there'd
> be no way to *accidentally* corrupt the repository by using a bad
> option. With this approach the race condition above goes away too.
>
>     $ git repack -a -d -q --split-filter=blob:limit=1m
>     pack-37b7443e3123549a2ddee31f616ae272c51cae90
>     pack-10789d94fcd99ffe1403b63b167c181a9df493cd
>
> First pack identifier being the objects that match the filter (ie:
> commits/trees/blobs <1m), and the second pack identifier being the
> objects that are excluded by the filter (blobs >1m).

I like this idea quite a bit. We also have a lot of existing tools that
would make an implementation fairly lightweight, namely pack-objects'
`--stdin-packs` mode.

Using that would look something like first having `repack` generate the
filtered pack first, remembering its name [1]. After that, we would run
`pack-objects` again, this time with `--stdin-packs`, where the positive
packs are the ones we're going to delete, and the negative pack(s)
is/are the filtered one generated in the last step.

The second invocation would leave us with a single pack which represents
all of the objects in packs we are about to delete, skipping any objects
that are in the filtered pack we just generated. In other words, it
would leave the repository with two packs: one with all of the objects
that met the filter criteria, and one with all objects that don't meet
the filter criteria.

A client could then upload the "doesn't meet the filter criteria" pack
off elsewhere, and then delete it locally. (I'm assuming this last part
in particular is orchestrated by some other command, and we aren't
encouraging users to run "rm" inside of .git/objects/pack!)

> An astute --i-know-what-im-doing reader could spot that you could just
> delete the second packfile and achieve the same outcome as the current
> proposed patch, subject to being confident the race condition hadn't
> happened to you.

Yeah, and I think this goes to my joking remark in the last paragraph.
If we allow users to delete packs at will, all bets are off regardless
of how safely we generate those packs. But I think splitting the
repository into two packs and _then_ dealing with one of them separately
as opposed to deleting objects which don't meet the filter criteria
immediately is moving in a safer direction.

> Thanks,
> Rob :)

Thanks,
Taylor

[1]: Optionally "name_s_", if we passed the `--max-pack-size` option to
`git pack-objects`, which we can trigger via a `git repack` option of
the same name.

Christian Couder Feb. 21, 2022, 9:10 p.m. UTC | #6

On Mon, Feb 21, 2022 at 4:11 AM Taylor Blau <me@ttaylorr.com> wrote:
>
> On Wed, Feb 16, 2022 at 04:07:14PM -0500, John Cai wrote:
> > > I don't know whether that is just around naming (--delete-filter /
> > > --drop-filter /
> > > --expire-filter ?), and/or making the documentation very explicit that
> > > this isn't so
> > > much "omitting certain objects from a packfile" as irretrievably
> > > deleting objects.
> >
> > Yeah, making the name very clear (I kind of like --delete-filter) would certainly help.

I am ok with making the name and doc very clear that it deletes
objects and they might be lost if they haven't been saved elsewhere
first.

> > Also, to have more protection we can either
> >
> > 1. add a config value that needs to be set to true for repack to remove
> > objects (repack.allowDestroyFilter).

I don't think it's of much value. We don't have such config values for
other possibly destructive operations.

> > 2. --filter is dry-run by default and prints out objects that would have been removed,
> > and it has to be combined with another flag --destroy in order for it to actually remove
> > objects from the odb.

I am not sure it's of much value either compared to naming it
--filter-destroy. It's likely to just make things more difficult for
users to understand.

> I share the same concern as Robert and Stolee do. But I think this issue
> goes deeper than just naming.
>
> Even if we called this `git repack --delete-filter` and only ran it with
> `--i-know-what-im-doing` flag, we would still be leaving repository
> corruption on the table, just making it marginally more difficult to
> achieve.

My opinion on this is that the promisor object mechanism assumes by
design that some objects are outside a repo, and that this repo
shouldn't care much about these objects possibly being corrupted.

It's the same for git LFS. As only a pointer file is stored in Git and
the real file is stored elsewhere, the Git repo doesn't care by design
about possible corruption of the real file.

I am not against a name and some docs that strongly state that users
should be very careful when using such a command, but otherwise I
think such a command is perfectly ok. We have other commands that by
design could lead to some objects or data being lost.

> I'm not familiar enough with the proposal to comment authoritatively,
> but it seems like we should be verifying that there is a promisor remote
> which promises any objects that we are about to filter out of the
> repository.

I think it could be a follow up mode that could be useful and safe,
but there should be no requirement for such a mode. In some cases you
know very much what you want and you don't want checks. For example if
you have taken proper care to transfer large objects to another
remote, you might just not need other possibly expansive checks.

[...]

> But as it stands right now, I worry that this feature is too easily
> misused and could result in unintended repository corruption.

Are you worrying about the UI or about what it does?

I am ok with improving the UI, but I think what it does is reasonable.

> I think verifying that that any objects we're about to delete exist
> somewhere should make this safer to use, though even then, I think we're
> still open to a TOCTOU race whereby the promisor has the objects when
> we're about to delete them (convincing Git that deleting those objects
> is OK to do) but gets rid of them after objects have been deleted from
> the local copy (leaving no copies of the object around).

Possible TOCTOU races are a good reason why something with no check is
perhaps a better goal for now.

Taylor Blau Feb. 21, 2022, 9:42 p.m. UTC | #7

On Mon, Feb 21, 2022 at 10:10:15PM +0100, Christian Couder wrote:
> > > Also, to have more protection we can either
> > >
> > > 1. add a config value that needs to be set to true for repack to remove
> > > objects (repack.allowDestroyFilter).
>
> I don't think it's of much value. We don't have such config values for
> other possibly destructive operations.
>
> > > 2. --filter is dry-run by default and prints out objects that would have been removed,
> > > and it has to be combined with another flag --destroy in order for it to actually remove
> > > objects from the odb.
>
> I am not sure it's of much value either compared to naming it
> --filter-destroy. It's likely to just make things more difficult for
> users to understand.

On this and the above, I agree with Christian.

> > I share the same concern as Robert and Stolee do. But I think this issue
> > goes deeper than just naming.
> >
> > Even if we called this `git repack --delete-filter` and only ran it with
> > `--i-know-what-im-doing` flag, we would still be leaving repository
> > corruption on the table, just making it marginally more difficult to
> > achieve.
>
> My opinion on this is that the promisor object mechanism assumes by
> design that some objects are outside a repo, and that this repo
> shouldn't care much about these objects possibly being corrupted.

For what it's worth, I am fine having a mode of repack which allows us
to remove objects that we know are stored by a promisor remote. But this
series doesn't do that, so users could easily run `git repack -d
--filter=...` and find that they have irrecoverably corrupted their
repository.

I think that there are some other reasonable directions, though. One
which Robert and I discussed was making it possible to split a
repository into two packs, one which holds objects that match some
`--filter` criteria, and one which holds the objects that don't match
that filter.

Another option would be to prune the repository according to objects
that are already made available by a promisor remote.

An appealing quality about the above two directions is that the first
doesn't actually remove any objects, just makes it easier to push a
whole pack of unwanted objects off to a promsior remote. The second
prunes the repository according to objects that are already made
available by the promisor remote. (Yes, there is a TOCTOU race there,
too, but it's the same prune-while-pushing race that Git already has
today).

> I am not against a name and some docs that strongly state that users
> should be very careful when using such a command, but otherwise I
> think such a command is perfectly ok. We have other commands that by
> design could lead to some objects or data being lost.

I can think of a handful of ways to remove objects which are unreachable
from a repository, but I am not sure we have any ways to remove objects
which are reachable.

> > But as it stands right now, I worry that this feature is too easily
> > misused and could result in unintended repository corruption.
>
> Are you worrying about the UI or about what it does?
>
> I am ok with improving the UI, but I think what it does is reasonable.

I am more worried about the proposal's functionality than its UI,
hopefully my concerns there are summarized above.

Thanks,
Taylor

Christian Couder Feb. 22, 2022, 5:11 p.m. UTC | #8

On Mon, Feb 21, 2022 at 10:42 PM Taylor Blau <me@ttaylorr.com> wrote:
>
> On Mon, Feb 21, 2022 at 10:10:15PM +0100, Christian Couder wrote:
> > > > Also, to have more protection we can either
> > > >
> > > > 1. add a config value that needs to be set to true for repack to remove
> > > > objects (repack.allowDestroyFilter).
> >
> > I don't think it's of much value. We don't have such config values for
> > other possibly destructive operations.
> >
> > > > 2. --filter is dry-run by default and prints out objects that would have been removed,
> > > > and it has to be combined with another flag --destroy in order for it to actually remove
> > > > objects from the odb.
> >
> > I am not sure it's of much value either compared to naming it
> > --filter-destroy. It's likely to just make things more difficult for
> > users to understand.
>
> On this and the above, I agree with Christian.
>
> > > I share the same concern as Robert and Stolee do. But I think this issue
> > > goes deeper than just naming.
> > >
> > > Even if we called this `git repack --delete-filter` and only ran it with
> > > `--i-know-what-im-doing` flag, we would still be leaving repository
> > > corruption on the table, just making it marginally more difficult to
> > > achieve.
> >
> > My opinion on this is that the promisor object mechanism assumes by
> > design that some objects are outside a repo, and that this repo
> > shouldn't care much about these objects possibly being corrupted.
>
> For what it's worth, I am fine having a mode of repack which allows us
> to remove objects that we know are stored by a promisor remote. But this
> series doesn't do that, so users could easily run `git repack -d
> --filter=...` and find that they have irrecoverably corrupted their
> repository.

In some cases we just know the objects we are removing are stored by a
promisor remote or are replicated on different physical machines or
both, so you should be fine with this.

If you are not fine with this because sometimes a user might use it
without knowing, then why are you ok with commands deleting refs not
checking that there isn't a regular repack removing dangling objects?

Also note that people who want to remove objects using a filter can
already do it by cloning with a filter and then replacing the original
packs with the packs from the clone. So refusing this new feature is
just making things more cumbersome.

> I think that there are some other reasonable directions, though. One
> which Robert and I discussed was making it possible to split a
> repository into two packs, one which holds objects that match some
> `--filter` criteria, and one which holds the objects that don't match
> that filter.

I am ok with someone implementing this feature, but if an option that
actually deletes the filtered objects is rejected then such a feature
will be used with some people just deleting one of the resulting packs
(and they might get it wrong), so I don't think any real safety will
be gained.

> Another option would be to prune the repository according to objects
> that are already made available by a promisor remote.

If the objects have just been properly transferred to the promisor
remote, the check will just waste resources.

> An appealing quality about the above two directions is that the first
> doesn't actually remove any objects, just makes it easier to push a
> whole pack of unwanted objects off to a promsior remote. The second
> prunes the repository according to objects that are already made
> available by the promisor remote. (Yes, there is a TOCTOU race there,
> too, but it's the same prune-while-pushing race that Git already has
> today).
>
> > I am not against a name and some docs that strongly state that users
> > should be very careful when using such a command, but otherwise I
> > think such a command is perfectly ok. We have other commands that by
> > design could lead to some objects or data being lost.
>
> I can think of a handful of ways to remove objects which are unreachable
> from a repository, but I am not sure we have any ways to remove objects
> which are reachable.

Cloning with a filter already does that. It's by design in the
promisor object and partial clone mechanisms that reachable objects
are removed. Having more than one promisor remote, which is an
existing mechanism, means that it's just wasteful to require all the
remotes to have all the reachable objects, so how could people easily
set up such remotes? Why make it unnecessarily hard and forbid a
straightforward way?

Taylor Blau Feb. 22, 2022, 5:33 p.m. UTC | #9

Hi Christian,

I think my objections may be based on a misunderstanding of John and
your original proposal. From reading [1], it seemed to me like a
required step of this proposal was to upload the objects you want to
filter out ahead of time, and then run `git repack -ad --filter=...`.

So my concerns thus far have been around the lack of cohesion between
(1) the filter which describes the set of objects uploaded to the HTTP
server, and (2) the filter used when re-filtering the repository.

If (1) and (2) aren't inverses of each other, then in the case where (2)
leaves behind an object which wasn't caught by (1), we have lost that
object.

If instead the server used in your script at [1] is a stand-in for an
ordinary Git remote, that changes my thinking significantly. See below
for more details:

On Tue, Feb 22, 2022 at 06:11:11PM +0100, Christian Couder wrote:
> > > > I share the same concern as Robert and Stolee do. But I think this issue
> > > > goes deeper than just naming.
> > > >
> > > > Even if we called this `git repack --delete-filter` and only ran it with
> > > > `--i-know-what-im-doing` flag, we would still be leaving repository
> > > > corruption on the table, just making it marginally more difficult to
> > > > achieve.
> > >
> > > My opinion on this is that the promisor object mechanism assumes by
> > > design that some objects are outside a repo, and that this repo
> > > shouldn't care much about these objects possibly being corrupted.
> >
> > For what it's worth, I am fine having a mode of repack which allows us
> > to remove objects that we know are stored by a promisor remote. But this
> > series doesn't do that, so users could easily run `git repack -d
> > --filter=...` and find that they have irrecoverably corrupted their
> > repository.
>
> In some cases we just know the objects we are removing are stored by a
> promisor remote or are replicated on different physical machines or
> both, so you should be fine with this.

I am definitely OK with a convenient way to re-filter your repository
locally so long as you know that the objects you are filtering out are
available via some promisor remote.

But perhaps I have misunderstood what this proposal is for. Reading
through John's original cover letter and the link to your demo script, I
understood that a key part of this was being able to upload the pack of
objects you were about to filter out of your local copy to some server
(not necessarily Git) over HTTP.

My hesitation so far has been based on that understanding. Reading these
patches, I don't see a mechanism to upload objects we're about to
expunge to a promisor remote.

But perhaps I'm misunderstanding: if you are instead assuming that the
existing set of remotes can serve any objects that we deleted, and this
is the way to delete them, then I am OK with that approach. But I think
either way, I am missing some details in the original proposal that
would have perhaps made it easier for me to understand what your goals
are.

In any case, this patch series doesn't seem to correctly set up a
promisor remote for me, since doing the following on a fresh clone of
git.git (after running "make"):

    $ bin-wrappers/git repack -ad --filter=blob:none
    $ bin-wrappers/git fsck

results in many "missing blob" and "missing link" lines out of output.

(FWIW, I think what's missing here is correctly setting up the affected
remote(s) as promisors and indicating that we're now in a filtered
setting when going from a full clone down to a partial one.)

> If you are not fine with this because sometimes a user might use it
> without knowing, then why are you ok with commands deleting refs not
> checking that there isn't a regular repack removing dangling objects?

I'm not sure I totally understand your question, but my general sense
has been "because we typically make it difficult / impossible to remove
reachable objects".

Thanks,
Taylor

[1]: https://gitlab.com/chriscool/partial-clone-demo/-/blob/master/http-promisor/server_demo.txt#L47-52

John Cai Feb. 22, 2022, 6:52 p.m. UTC | #10

Hi Taylor,

On 21 Feb 2022, at 16:42, Taylor Blau wrote:

> On Mon, Feb 21, 2022 at 10:10:15PM +0100, Christian Couder wrote:
>>>> Also, to have more protection we can either
>>>>
>>>> 1. add a config value that needs to be set to true for repack to remove
>>>> objects (repack.allowDestroyFilter).
>>
>> I don't think it's of much value. We don't have such config values for
>> other possibly destructive operations.
>>
>>>> 2. --filter is dry-run by default and prints out objects that would have been removed,
>>>> and it has to be combined with another flag --destroy in order for it to actually remove
>>>> objects from the odb.
>>
>> I am not sure it's of much value either compared to naming it
>> --filter-destroy. It's likely to just make things more difficult for
>> users to understand.
>
> On this and the above, I agree with Christian.
>
>>> I share the same concern as Robert and Stolee do. But I think this issue
>>> goes deeper than just naming.
>>>
>>> Even if we called this `git repack --delete-filter` and only ran it with
>>> `--i-know-what-im-doing` flag, we would still be leaving repository
>>> corruption on the table, just making it marginally more difficult to
>>> achieve.
>>
>> My opinion on this is that the promisor object mechanism assumes by
>> design that some objects are outside a repo, and that this repo
>> shouldn't care much about these objects possibly being corrupted.
>
> For what it's worth, I am fine having a mode of repack which allows us
> to remove objects that we know are stored by a promisor remote. But this
> series doesn't do that, so users could easily run `git repack -d
> --filter=...` and find that they have irrecoverably corrupted their
> repository.
>
> I think that there are some other reasonable directions, though. One
> which Robert and I discussed was making it possible to split a
> repository into two packs, one which holds objects that match some
> `--filter` criteria, and one which holds the objects that don't match
> that filter.
>
> Another option would be to prune the repository according to objects
> that are already made available by a promisor remote.

Thanks for the discussion around the two packfile idea. Definitely interesting.
However, I'm leaning towards the second option here where we ensure that objects that
are about to be deleted can be retrieved via a promisor remote. That way we have an
easy path to recovery.

>
> An appealing quality about the above two directions is that the first
> doesn't actually remove any objects, just makes it easier to push a
> whole pack of unwanted objects off to a promsior remote. The second
> prunes the repository according to objects that are already made
> available by the promisor remote. (Yes, there is a TOCTOU race there,
> too, but it's the same prune-while-pushing race that Git already has
> today).
>
>> I am not against a name and some docs that strongly state that users
>> should be very careful when using such a command, but otherwise I
>> think such a command is perfectly ok. We have other commands that by
>> design could lead to some objects or data being lost.
>
> I can think of a handful of ways to remove objects which are unreachable
> from a repository, but I am not sure we have any ways to remove objects
> which are reachable.
>
>>> But as it stands right now, I worry that this feature is too easily
>>> misused and could result in unintended repository corruption.
>>
>> Are you worrying about the UI or about what it does?
>>
>> I am ok with improving the UI, but I think what it does is reasonable.
>
> I am more worried about the proposal's functionality than its UI,
> hopefully my concerns there are summarized above.
>
> Thanks,
> Taylor

Taylor Blau Feb. 22, 2022, 7:35 p.m. UTC | #11

On Tue, Feb 22, 2022 at 01:52:09PM -0500, John Cai wrote:
> > Another option would be to prune the repository according to objects
> > that are already made available by a promisor remote.
>
> Thanks for the discussion around the two packfile idea. Definitely
> interesting.  However, I'm leaning towards the second option here
> where we ensure that objects that are about to be deleted can be
> retrieved via a promisor remote. That way we have an easy path to
> recovery.

Yeah, I think this may have all come from a potential misunderstanding I
had with the original proposal. More of the details there can be found
in [1].

But assuming that this proposal isn't about first offloading some
objects to an auxiliary (non-Git) server, then I think refiltering into
a single pack makes sense, because we trust the remote to still have
any objects we deleted.

(The snag I hit was that it seemed like your+Christian's proposal hinged
on using _two_ filters, one to produce the set of objects you wanted to
get rid of, and another to produce the set of objects you wanted to
keep. The lack of cohesion between the two is what gave me pause, but it
may not have been what either of you were thinking in the first place).

Anyway, I'm not sure "spitting" a repository along a `--filter` into two
packs is all that interesting of an idea, but it is something we could
do if it became useful to you without writing too much new code (and
instead leveraging `git pack-objects --stdin-packs`).

Thanks,
Taylor

[1]: https://lore.kernel.org/git/YhUeUCIetu%2FaOu6k@nand.local/

Robert Coup Feb. 23, 2022, 3:40 p.m. UTC | #12

Hi Christian,

On Tue, 22 Feb 2022 at 17:11, Christian Couder
<christian.couder@gmail.com> wrote:
>
> In some cases we just know the objects we are removing are stored by a
> promisor remote or are replicated on different physical machines or
> both, so you should be fine with this.

From my point of view I think the goal here is great.

> > Another option would be to prune the repository according to objects
> > that are already made available by a promisor remote.
>
> If the objects have just been properly transferred to the promisor
> remote, the check will just waste resources.

As far as I can see this patch doesn't know or check that any of the
filtered-out objects are held anywhere else... it simply applies a
filter during repacking and the excluded objects are dropped. That's
the aspect I have concerns about.

Maybe an approach where you build/get/maintain a list of
objects-I-definitely-have-elsewhere and pass it as an exclude list to
repack would be a cleaner/safer/easier solution? If you're confident
enough you don't need to check with the promisor remote then you can
use a local list, or even something generated with `rev-list
--filter=`.

Thanks,

Rob :)

Junio C Hamano Feb. 23, 2022, 7:31 p.m. UTC | #13

Christian Couder <christian.couder@gmail.com> writes:

>> For what it's worth, I am fine having a mode of repack which allows us
>> to remove objects that we know are stored by a promisor remote. But this
>> series doesn't do that, so users could easily run `git repack -d
>> --filter=...` and find that they have irrecoverably corrupted their
>> repository.
>
> In some cases we just know the objects we are removing are stored by a
> promisor remote or are replicated on different physical machines or
> both, so you should be fine with this.

So, we need to decide if an object we have that is outside the
narrowed filter definition was (and still is, but let's keep the
assumption the whole lazy clone mechanism makes: promisor remotes
will never shed objects that they once served) available at the
promisor remote, but I suspect we have too little information to
reliably do so.  It is OK to assume that objects in existing packs
taken from the promisor remotes and everything reachable from them
(but missing from our object store) will be available to us from
there.  But if we see an object that is outside of _new_ filter spec
(e.g. you fetched with "max 100MB", now you are refiltering with
"max 50MB", narrowing the spec, and you need to decide for an object
that weigh 70MB), can we tell if that can be retrieved from the
promisor or is it unique to our repository until we push it out?  I
am not sure.  For that matter, do we even have a way to compare if
the new filter spec is a subset, a superset, or neither, of the
original filter spec?

> If you are not fine with this because sometimes a user might use it
> without knowing, then why are you ok with commands deleting refs not
> checking that there isn't a regular repack removing dangling objects?

Sorry, I do not follow this argument.  Your user may do "branch -D"
because the branch deleted is no longer needed, which may mean that
objects only reachable from the deleted branch are no longer needed.
I do not see what repack has anything to do with that.  As long as
the filter spec does not change (in other words, before this series
is applied), the repack that discards objects that are known to be
reachable from objects in packs retrieved from promisor remote, the
objects that are no longer reachable may be removed and that will
not lose objects that we do not know to be retrievable from there
(which is different from objects that we know are unretrievable).
But with filter spec changing after the fact, I am not sure if that
is safe.  IOW, "commands deleting refs" may have been OK without
this series, but this series may be what makes it not OK, no?

Puzzled.

John Cai Feb. 26, 2022, 4:01 p.m. UTC | #14

Thank you for everyone's feedback. Really appreciate the collaboration!

On 23 Feb 2022, at 14:31, Junio C Hamano wrote:

> Christian Couder <christian.couder@gmail.com> writes:
>
>>> For what it's worth, I am fine having a mode of repack which allows us
>>> to remove objects that we know are stored by a promisor remote. But this
>>> series doesn't do that, so users could easily run `git repack -d
>>> --filter=...` and find that they have irrecoverably corrupted their
>>> repository.
>>
>> In some cases we just know the objects we are removing are stored by a
>> promisor remote or are replicated on different physical machines or
>> both, so you should be fine with this.
>
> So, we need to decide if an object we have that is outside the
> narrowed filter definition was (and still is, but let's keep the
> assumption the whole lazy clone mechanism makes: promisor remotes
> will never shed objects that they once served) available at the
> promisor remote, but I suspect we have too little information to
> reliably do so.  It is OK to assume that objects in existing packs
> taken from the promisor remotes and everything reachable from them
> (but missing from our object store) will be available to us from
> there.  But if we see an object that is outside of _new_ filter spec
> (e.g. you fetched with "max 100MB", now you are refiltering with
> "max 50MB", narrowing the spec, and you need to decide for an object
> that weigh 70MB), can we tell if that can be retrieved from the
> promisor or is it unique to our repository until we push it out?  I
> am not sure.  For that matter, do we even have a way to compare if
> the new filter spec is a subset, a superset, or neither, of the
> original filter spec?

let me try to summarize (perhaps over simplify) the main concern folks have
with this feature, so please correct me if I'm wrong!

As a user, if I apply a filter that ends up deleting objects that it turns
out do not exist anywhere else, then I have irrecoverably corrupted my
repository.

Before git allows me to delete objects from my repository, it should be pretty
certain that I have path to recover those objects if I need to.

Is that correct? It seems to me that, put another way, we don't want to give
users too much rope to hang themselves.

I can see why we would want to do this. In this case, there have been a couple
of alternative ideas proposed throughout this thread that I think are viable and
I wanted to get folks thoughts.

1. split pack file - (Rob gave this idea and Taylor provided some more detail on
   how using pack-objects would make it fairly straightforward to implement)

when a user wants to apply a filter that removes objects from their repository,
split the packfile into one containing objects that are filtered out, and
another packfile with objects that remain.

pros: simple to implement
cons: does not address the question "how sure am I that the objects I want to
filter out of my repository exist on a promsior remote?"

2. check the promisor remotes to see if they contain the objects that are about
   to get deleted. Only delete objects that we find on promisor remotes.

pros: provides assurance that I have access to objects I am about to delete from
a promsior remote.
cons: more complex to implement. [*]

Out of these two, I like 2 more for the aforementioned pros.

* I am beginning to look into how fetches work and am still pretty new to the
codebase so I don't know if this is even feasible, but I was thinking perhaps
we could write a function that fetches with a --filter and create a promisor
packfile containing promisor objects (this operaiton would have to somehow
ignore the presence of the actual objects in the repository). Then, we would
have a record of objects we have access to. Then, repack --filter can remove
only the objects contained in this promisor packfile.

>
>> If you are not fine with this because sometimes a user might use it
>> without knowing, then why are you ok with commands deleting refs not
>> checking that there isn't a regular repack removing dangling objects?
>
> Sorry, I do not follow this argument.  Your user may do "branch -D"
> because the branch deleted is no longer needed, which may mean that
> objects only reachable from the deleted branch are no longer needed.
> I do not see what repack has anything to do with that.  As long as
> the filter spec does not change (in other words, before this series
> is applied), the repack that discards objects that are known to be
> reachable from objects in packs retrieved from promisor remote, the
> objects that are no longer reachable may be removed and that will
> not lose objects that we do not know to be retrievable from there
> (which is different from objects that we know are unretrievable).
> But with filter spec changing after the fact, I am not sure if that
> is safe.  IOW, "commands deleting refs" may have been OK without
> this series, but this series may be what makes it not OK, no?
>
> Puzzled.

Taylor Blau Feb. 26, 2022, 5:29 p.m. UTC | #15

On Sat, Feb 26, 2022 at 11:01:46AM -0500, John Cai wrote:
> let me try to summarize (perhaps over simplify) the main concern folks
> have with this feature, so please correct me if I'm wrong!
>
> As a user, if I apply a filter that ends up deleting objects that it
> turns out do not exist anywhere else, then I have irrecoverably
> corrupted my repository.
>
> Before git allows me to delete objects from my repository, it should
> be pretty certain that I have path to recover those objects if I need
> to.
>
> Is that correct? It seems to me that, put another way, we don't want
> to give users too much rope to hang themselves.

I wrote about my concerns in some more detail in [1], but the thing I
was most unclear on was how your demo script[2] was supposed to work.

Namely, I wasn't sure if you had intended to use two separate filters to
"re-filter" a repository, one to filter objects to be uploaded to a
content store, and another to filter objects to be expunged from the
repository. I have major concerns with that approach, namely that if
each of the filters is not exactly the inverse of the other, then we
will either upload too few objects, or delete too many.

My other concern was around what guarantees we currently provide for a
promisor remote. My understanding is that we expect an object which was
received from the promisor remote to always be fetch-able later on. If
that's the case, then I don't mind the idea of refiltering a repository,
provided that you only need to specify a filter once.

So the suggestion about splitting a repository into two packs was a
potential way to mediate the "two filter" problem, since the two packs
you get exactly correspond to the set of objects that match the filter,
and the set of objects that _don't_ match the filter.

In either case, I tried to use the patches in [1] and was able to
corrupt my local repository (even when fetching from a remote that held
onto the objects I had pruned locally).

Thanks,
Taylor

[1]: https://lore.kernel.org/git/YhUeUCIetu%2FaOu6k@nand.local/
[2]: https://gitlab.com/chriscool/partial-clone-demo/-/blob/master/http-promisor/server_demo.txt#L47-52

John Cai Feb. 26, 2022, 8:19 p.m. UTC | #16

Hi Taylor,

(resending this because my email client misbehaved and set the mime type to html)

On 26 Feb 2022, at 12:29, Taylor Blau wrote:

> On Sat, Feb 26, 2022 at 11:01:46AM -0500, John Cai wrote:
>> let me try to summarize (perhaps over simplify) the main concern folks
>> have with this feature, so please correct me if I'm wrong!
>>
>> As a user, if I apply a filter that ends up deleting objects that it
>> turns out do not exist anywhere else, then I have irrecoverably
>> corrupted my repository.
>>
>> Before git allows me to delete objects from my repository, it should
>> be pretty certain that I have path to recover those objects if I need
>> to.
>>
>> Is that correct? It seems to me that, put another way, we don't want
>> to give users too much rope to hang themselves.
>
> I wrote about my concerns in some more detail in [1], but the thing I
> was most unclear on was how your demo script[2] was supposed to work.
>
> Namely, I wasn't sure if you had intended to use two separate filters to
> "re-filter" a repository, one to filter objects to be uploaded to a
> content store, and another to filter objects to be expunged from the
> repository. I have major concerns with that approach, namely that if
> each of the filters is not exactly the inverse of the other, then we
> will either upload too few objects, or delete too many.

Thanks for bringing this up again. I meant to write back regarding what you raised
in the other part of this thread. I think this is a valid concern. To attain the
goal of offloading certain blobs onto another server(B) and saving space on a git
server(A), then there will essentially be two steps. One to upload objects to (B),
and one to remove objects from (A). As you said, these two need to be the inverse of each
other or else you might end up with missing objects.

Thinking about it more, there is also an issue of timing. Even if the filters
are exact inverses of each other, let's say we have the following order of
events:

- (A)'s large blobs get upload to (B)
- large blob (C) get added to (A)
- (A) gets repacked with a filter

In this case we could lose (C) forever. So it does seem like we need some built in guarantee
that we only shed objects from the repo if we know we can retrieve them later on.

>
> My other concern was around what guarantees we currently provide for a
> promisor remote. My understanding is that we expect an object which was
> received from the promisor remote to always be fetch-able later on. If
> that's the case, then I don't mind the idea of refiltering a repository,
> provided that you only need to specify a filter once.

Could you clarify what you mean by re-filtering a repository? By that I assumed
it meant specifying a filter eg: 100mb, and then narrowing it by specifying a
50mb filter.
>
> So the suggestion about splitting a repository into two packs was a
> potential way to mediate the "two filter" problem, since the two packs
> you get exactly correspond to the set of objects that match the filter,
> and the set of objects that _don't_ match the filter.
>
> In either case, I tried to use the patches in [1] and was able to
> corrupt my local repository (even when fetching from a remote that held
> onto the objects I had pruned locally).
>
> Thanks,
> Taylor

Thanks!
John

>
> [1]: https://lore.kernel.org/git/YhUeUCIetu%2FaOu6k@nand.local/
> [2]: https://gitlab.com/chriscool/partial-clone-demo/-/blob/master/http-promisor/server_demo.txt#L47-52

Taylor Blau Feb. 26, 2022, 8:30 p.m. UTC | #17

On Sat, Feb 26, 2022 at 03:19:11PM -0500, John Cai wrote:
> Thanks for bringing this up again. I meant to write back regarding what you raised
> in the other part of this thread. I think this is a valid concern. To attain the
> goal of offloading certain blobs onto another server(B) and saving space on a git
> server(A), then there will essentially be two steps. One to upload objects to (B),
> and one to remove objects from (A). As you said, these two need to be the inverse of each
> other or else you might end up with missing objects.

Do you mean that you want to offload objects both from a local clone of
some repository, _and_ the original remote it was cloned from?

I don't understand what the role of "another server" is here. If this
proposal was about making it easy to remove objects from a local copy of
a repository based on a filter provided that there was a Git server
elsewhere that could act as a promisor remote, than that makes sense to
me.

But I think I'm not quite understanding the rest of what you're
suggesting.

> > My other concern was around what guarantees we currently provide for a
> > promisor remote. My understanding is that we expect an object which was
> > received from the promisor remote to always be fetch-able later on. If
> > that's the case, then I don't mind the idea of refiltering a repository,
> > provided that you only need to specify a filter once.
>
> Could you clarify what you mean by re-filtering a repository? By that I assumed
> it meant specifying a filter eg: 100mb, and then narrowing it by specifying a
> 50mb filter.

I meant: applying a filter to a local clone (either where there wasn't a
filter before, or a filter which matched more objects) and then removing
objects that don't match the filter.

But your response makes me think of another potential issue. What
happens if I do the following:

    $ git repack -ad --filter=blob:limit=100k
    $ git repack -ad --filter=blob:limit=200k

What should the second invocation do? I would expect that it needs to do
a fetch from the promisor remote to recover any blobs between (100, 200]
KB in size, since they would be gone after the first repack.

This is a problem not just with two consecutive `git repack --filter`s,
I think, since you could cook up the same situation with:

    $ git clone --filter=blob:limit=100k git@github.com:git
    $ git -C git repack -ad --filter=blob:limit=200k

I don't think the existing patches handle this situation, so I'm curious
whether it's something you have considered or not before.

(Unrelated to the above, but please feel free to trim any quoted parts
of emails when responding if they get overly long.)

Thanks,
Taylor

John Cai Feb. 26, 2022, 9:05 p.m. UTC | #18

Hi Taylor,

On 26 Feb 2022, at 15:30, Taylor Blau wrote:

> On Sat, Feb 26, 2022 at 03:19:11PM -0500, John Cai wrote:
>> Thanks for bringing this up again. I meant to write back regarding what you raised
>> in the other part of this thread. I think this is a valid concern. To attain the
>> goal of offloading certain blobs onto another server(B) and saving space on a git
>> server(A), then there will essentially be two steps. One to upload objects to (B),
>> and one to remove objects from (A). As you said, these two need to be the inverse of each
>> other or else you might end up with missing objects.
>
> Do you mean that you want to offload objects both from a local clone of
> some repository, _and_ the original remote it was cloned from?

yes, exactly. The "another server" would be something like an http server, OR another remote
which hosts a subset of the objects (let's say the large blobs).
>
> I don't understand what the role of "another server" is here. If this
> proposal was about making it easy to remove objects from a local copy of
> a repository based on a filter provided that there was a Git server
> elsewhere that could act as a promisor remote, than that makes sense to
> me.
>
> But I think I'm not quite understanding the rest of what you're
> suggesting.

Sorry for the lack of clarity here. The goal is to make it easy for a remote to offload a subset
of its objects to __another__ remote (either a Git server or an http server through a remote helper).
>
>>> My other concern was around what guarantees we currently provide for a
>>> promisor remote. My understanding is that we expect an object which was
>>> received from the promisor remote to always be fetch-able later on. If
>>> that's the case, then I don't mind the idea of refiltering a repository,
>>> provided that you only need to specify a filter once.
>>
>> Could you clarify what you mean by re-filtering a repository? By that I assumed
>> it meant specifying a filter eg: 100mb, and then narrowing it by specifying a
>> 50mb filter.
>
> I meant: applying a filter to a local clone (either where there wasn't a
> filter before, or a filter which matched more objects) and then removing
> objects that don't match the filter.
>
> But your response makes me think of another potential issue. What
> happens if I do the following:
>
>     $ git repack -ad --filter=blob:limit=100k
>     $ git repack -ad --filter=blob:limit=200k
>
> What should the second invocation do? I would expect that it needs to do
> a fetch from the promisor remote to recover any blobs between (100, 200]
> KB in size, since they would be gone after the first repack.
>
> This is a problem not just with two consecutive `git repack --filter`s,
> I think, since you could cook up the same situation with:
>
>     $ git clone --filter=blob:limit=100k git@github.com:git
>     $ git -C git repack -ad --filter=blob:limit=200k
>
> I don't think the existing patches handle this situation, so I'm curious
> whether it's something you have considered or not before.

I have not-will have to think through this case, but this sound similar to
what [1] is about.
is about.

>
> (Unrelated to the above, but please feel free to trim any quoted parts
> of emails when responding if they get overly long.)
>
> Thanks,
> Taylor

Thanks
John

1. https://lore.kernel.org/git/pull.1138.v2.git.1645719218.gitgitgadget@gmail.com/

Taylor Blau Feb. 26, 2022, 9:44 p.m. UTC | #19

On Sat, Feb 26, 2022 at 04:05:37PM -0500, John Cai wrote:
> Hi Taylor,
>
> On 26 Feb 2022, at 15:30, Taylor Blau wrote:
>
> > On Sat, Feb 26, 2022 at 03:19:11PM -0500, John Cai wrote:
> >> Thanks for bringing this up again. I meant to write back regarding what you raised
> >> in the other part of this thread. I think this is a valid concern. To attain the
> >> goal of offloading certain blobs onto another server(B) and saving space on a git
> >> server(A), then there will essentially be two steps. One to upload objects to (B),
> >> and one to remove objects from (A). As you said, these two need to be the inverse of each
> >> other or else you might end up with missing objects.
> >
> > Do you mean that you want to offload objects both from a local clone of
> > some repository, _and_ the original remote it was cloned from?
>
> yes, exactly. The "another server" would be something like an http server, OR another remote
> which hosts a subset of the objects (let's say the large blobs).
> >
> > I don't understand what the role of "another server" is here. If this
> > proposal was about making it easy to remove objects from a local copy of
> > a repository based on a filter provided that there was a Git server
> > elsewhere that could act as a promisor remote, than that makes sense to
> > me.
> >
> > But I think I'm not quite understanding the rest of what you're
> > suggesting.
>
> Sorry for the lack of clarity here. The goal is to make it easy for a remote to offload a subset
> of its objects to __another__ remote (either a Git server or an http server through a remote helper).

Does the other server then act as a promisor remote in conjunction with
the Git server? I'm having trouble understanding why the _Git_ remote
you originally cloned from needs to offload its objects, too.

So I think the list would benefit from understanding some more of the
details and motivation there. But it would also benefit us to have some
understanding of how we'll ensure that any objects which are moved out
of a Git repository make their way to another server.

I am curious to hear Jonathan Tan's thoughts on this all, too.

Thanks,
Taylor

[v2,0/4,RFC] repack: add --filter=

Message

Comments