diff mbox series

[v2,5/8] repack: add `--filter=<filter-spec>` option

Message ID 20230705060812.2865188-6-christian.couder@gmail.com (mailing list archive)
State Superseded
Headers show
Series Repack objects into separate packfiles based on a filter | expand

Commit Message

Christian Couder July 5, 2023, 6:08 a.m. UTC
This new option puts the objects specified by `<filter-spec>` into a
separate packfile.

This could be useful if, for example, some large blobs take a lot of
precious space on fast storage while they are rarely accessed. It could
make sense to move them into a separate cheaper, though slower, storage.

In other use cases it might make sense to put all the blobs into
separate storage.

This is done by running two `git pack-objects` commands. The first one
is run with `--filter=<filter-spec>`, using the specified filter. It
packs objects while omitting the objects specified by the filter.
Then another `git pack-objects` command is launched using
`--stdin-packs`. We pass it all the previously existing packs into its
stdin, so that it will pack all the objects in the previously existing
packs. But we also pass into its stdin, the pack created by the previous
`git pack-objects --filter=<filter-spec>` command as well as the kept
packs, all prefixed with '^', so that the objects in these packs will be
omitted from the resulting pack. The result is that only the objects
filtered out by the first `git pack-objects` command are in the pack
resulting from the second `git pack-objects` command.

Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/git-repack.txt |  9 +++++
 builtin/repack.c             | 67 ++++++++++++++++++++++++++++++++++++
 t/t7700-repack.sh            | 16 +++++++++
 3 files changed, 92 insertions(+)

Comments

Junio C Hamano July 5, 2023, 5:53 p.m. UTC | #1
Christian Couder <christian.couder@gmail.com> writes:

> This could be useful if, for example, some large blobs take a lot of
> precious space on fast storage while they are rarely accessed. It could
> make sense to move them into a separate cheaper, though slower, storage.
>
> In other use cases it might make sense to put all the blobs into
> separate storage.

Minor nit.  Aren't the above two the same use case?

> This is done by running two `git pack-objects` commands. The first one
> is run with `--filter=<filter-spec>`, using the specified filter. It
> packs objects while omitting the objects specified by the filter.
> Then another `git pack-objects` command is launched using
> `--stdin-packs`. We pass it all the previously existing packs into its
> stdin, so that it will pack all the objects in the previously existing
> packs. But we also pass into its stdin, the pack created by the previous
> `git pack-objects --filter=<filter-spec>` command as well as the kept
> packs, all prefixed with '^', so that the objects in these packs will be
> omitted from the resulting pack.

When I started reading the paragraph, the first question that came
to my mind was if these two pack-objects processes can and should be
run in parallel, which is answered in the part near the end of the
paragraph.  It may be a good idea to start the paragraph with "by
running `git pack-objects` command twice in a row" or something to
make it clear that one should (and cannot be) run before the other
completes.

In fact, isn't the call site of write_filtered_pack() in this patch
a bit too early?  The subprocess that runs with "--stdin-packs" is
started and told about the names of the pack we are going to create,
and it does not start processing until it reads everything (i.e. we
run fclose(in) in the write_filtered_pack() function), but the loop
over "names" string list in the caller that moves the tempfiles to
their final filenames comes after the call to close_object_store()
we see in the post context of the call to write_filtered_pack() that
is new in this patch.

The "--stdin-packs" one is told to exclude objects that appear in
these packs, so if the main process is a bit slow to finalize the
packfiles it created (and told the "--stdin-packs" process about),
it will not lead to repository corruption---just some objects are
included in the packfiles "--stdin-packs" one creates even though
they do not have to.  So it does not sound like a huge problem to
me, but still it somehow looks wrong.  Am I misreading the code?

Thanks.
Junio C Hamano July 5, 2023, 6:12 p.m. UTC | #2
Christian Couder <christian.couder@gmail.com> writes:

> +--filter=<filter-spec>::
> +	Remove objects matching the filter specification from the
> +	resulting packfile and put them into a separate packfile. Note
> +	that objects used in the working directory are not filtered
> +	out. So for the split to fully work, it's best to perform it
> +	in a bare repo and to use the `-a` and `-d` options along with
> +	this option.  See linkgit:git-rev-list[1] for valid
> +	`<filter-spec>` forms.

After running the command with this option once, we will have two
packfiles, one with objects that match and the other with objects
that do not match the filter spec.  Then what is the next step for
the user of this feature?  Moving the former to a slower storage
was cited as a motivation for the feature, but can the user tell
which one of these two packfiles is the one that consists of the
filtered out objects?  If there is no mechansim to do so, shouldn't
we have one to make this feature more usable?

At the level of "pack-objects" command, we report the new packfiles
so that the user does not have to take "ls .git/objects/pack" before
and after the operation to compare and learn which ones are new.
I do not think "repack" that is a Porcelain should do such a
reporting on its standard output, but that means either the feature
should probably be done at the plumbing level (i.e. "pack-objects"),
or the marking of the new packfiles needs to be done in a way that
tools can later find them out, e.g. on the filesystem, similar to
the way ".keep" marker tells which ones are not to be repacked, etc.
Christian Couder July 24, 2023, 9:01 a.m. UTC | #3
On Wed, Jul 5, 2023 at 7:53 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> > This could be useful if, for example, some large blobs take a lot of
> > precious space on fast storage while they are rarely accessed. It could
> > make sense to move them into a separate cheaper, though slower, storage.
> >
> > In other use cases it might make sense to put all the blobs into
> > separate storage.
>
> Minor nit.  Aren't the above two the same use case?

In the first case only some large blobs are moved to slower storage
and in the other case all the blobs are moved to slower storage. So
yeah the use cases are very similar. Not sure if and how I can improve
the above wording though.

> > This is done by running two `git pack-objects` commands. The first one
> > is run with `--filter=<filter-spec>`, using the specified filter. It
> > packs objects while omitting the objects specified by the filter.
> > Then another `git pack-objects` command is launched using
> > `--stdin-packs`. We pass it all the previously existing packs into its
> > stdin, so that it will pack all the objects in the previously existing
> > packs. But we also pass into its stdin, the pack created by the previous
> > `git pack-objects --filter=<filter-spec>` command as well as the kept
> > packs, all prefixed with '^', so that the objects in these packs will be
> > omitted from the resulting pack.
>
> When I started reading the paragraph, the first question that came
> to my mind was if these two pack-objects processes can and should be
> run in parallel, which is answered in the part near the end of the
> paragraph.  It may be a good idea to start the paragraph with "by
> running `git pack-objects` command twice in a row" or something to
> make it clear that one should (and cannot be) run before the other
> completes.

Ok, in version 3 that I just sent, that paragraph starts with:

"
   This is done by running `git pack-objects` twice in a row. The first
   command is run with `--filter=<filter-spec>`, using the specified
   filter.
"

> In fact, isn't the call site of write_filtered_pack() in this patch
> a bit too early?  The subprocess that runs with "--stdin-packs" is
> started and told about the names of the pack we are going to create,
> and it does not start processing until it reads everything (i.e. we
> run fclose(in) in the write_filtered_pack() function), but the loop
> over "names" string list in the caller that moves the tempfiles to
> their final filenames comes after the call to close_object_store()
> we see in the post context of the call to write_filtered_pack() that
> is new in this patch.

I think it can work if the call to write_filtered_pack() is either
before the call to close_object_store() or after it. It would just use
the tempfiles with their temporary name in the first case and with
their final name in the second case.

write_filtered_pack() is very similar to write_cruft_pack() which is
called before the call to close_object_store(), so I prefer to keep it
before that call too, if possible, for consistency.

> The "--stdin-packs" one is told to exclude objects that appear in
> these packs, so if the main process is a bit slow to finalize the
> packfiles it created (and told the "--stdin-packs" process about),
> it will not lead to repository corruption---just some objects are
> included in the packfiles "--stdin-packs" one creates even though
> they do not have to.  So it does not sound like a huge problem to
> me, but still it somehow looks wrong.  Am I misreading the code?

I would have thought that as finish_pack_objects_cmd() calls
finish_command() the first pack-objects command (which is called
without --stdout) should be completely finished and the packfiles
fully created when write_filtered_pack() (or write_cruft_pack()) is
called, even if the object store is not closed, but you might be
right.

Perhaps this could be dealt with separately though, as I think we
might want to fix write_cruft_pack() first then.
Christian Couder July 24, 2023, 9:02 a.m. UTC | #4
On Wed, Jul 5, 2023 at 8:12 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> > +--filter=<filter-spec>::
> > +     Remove objects matching the filter specification from the
> > +     resulting packfile and put them into a separate packfile. Note
> > +     that objects used in the working directory are not filtered
> > +     out. So for the split to fully work, it's best to perform it
> > +     in a bare repo and to use the `-a` and `-d` options along with
> > +     this option.  See linkgit:git-rev-list[1] for valid
> > +     `<filter-spec>` forms.
>
> After running the command with this option once, we will have two
> packfiles, one with objects that match and the other with objects
> that do not match the filter spec.  Then what is the next step for
> the user of this feature?  Moving the former to a slower storage
> was cited as a motivation for the feature, but can the user tell
> which one of these two packfiles is the one that consists of the
> filtered out objects?  If there is no mechansim to do so, shouldn't
> we have one to make this feature more usable?
>
> At the level of "pack-objects" command, we report the new packfiles
> so that the user does not have to take "ls .git/objects/pack" before
> and after the operation to compare and learn which ones are new.
> I do not think "repack" that is a Porcelain should do such a
> reporting on its standard output, but that means either the feature
> should probably be done at the plumbing level (i.e. "pack-objects"),
> or the marking of the new packfiles needs to be done in a way that
> tools can later find them out, e.g. on the filesystem, similar to
> the way ".keep" marker tells which ones are not to be repacked, etc.

I think commands like `git verify-pack -v ...` can already tell a bit
about the content of a packfile.

Also this patch series adds `test-tool find-pack` which can help too.
It could maybe be converted into a new `git verify-pack --find-pack`
option if users want it.

Then, as you later found out, there is the --filter-to=<dir> option
added later by this series.

To clarify this, I have added the following to the commit message in
the version 3 I just sent:

"
   It's possible to find which new packfile contains the filtered out
   objects using one of the following:

     - `git verify-pack -v ...`,
     - `test-tool find-pack ...`, which a previous commit added,
     - `--filter-to=<dir>`, which a following commit will add to specify
       where the pack containing the filtered out objects will be.
"
Junio C Hamano July 24, 2023, 6:28 p.m. UTC | #5
Christian Couder <christian.couder@gmail.com> writes:

>> Minor nit.  Aren't the above two the same use case?
>
> In the first case only some large blobs are moved to slower storage
> and in the other case all the blobs are moved to slower storage. So
> yeah the use cases are very similar. Not sure if and how I can improve
> the above wording though.

Just by removing one or the other, it would be quite improved, no?
Moving away some blobs could move away all or just a selected
subset.

> I think it can work if the call to write_filtered_pack() is either
> before the call to close_object_store() or after it. It would just use
> the tempfiles with their temporary name in the first case and with
> their final name in the second case.
>
> write_filtered_pack() is very similar to write_cruft_pack() which is
> called before the call to close_object_store(), so I prefer to keep it
> before that call too, if possible, for consistency.

As long as the set-up is not racy, either would be OK, as the names
are not recorded in the end result.

If the upstream tells the downstream the temporary's name and then
finializes the temporary to the final name before the downstream
reacts to the input, however, then by the time downstream starts
working on the file, the file may not exist under its original,
temporary name, and that kind of race was what I was worried about.

> Perhaps this could be dealt with separately though, as I think we
> might want to fix write_cruft_pack() first then.

Sorry, I am not understanding this quite well.  Do you mean we
should add one more known-to-be-racy-and-incorrect codepath because
there is already a codepath that needs to be fixed anyway?

If write_cruft_pack() has a similar issue, then yeah, let's fix that
first (testing might be tricky for any racy bugs).  And let's use
the same technique as used to fix it in this series, too, so that we
do not reintroduce a similar bug due to racy setup.

Thanks.
Christian Couder July 25, 2023, 3:22 p.m. UTC | #6
On Mon, Jul 24, 2023 at 8:28 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> >> Minor nit.  Aren't the above two the same use case?
> >
> > In the first case only some large blobs are moved to slower storage
> > and in the other case all the blobs are moved to slower storage. So
> > yeah the use cases are very similar. Not sure if and how I can improve
> > the above wording though.
>
> Just by removing one or the other, it would be quite improved, no?
> Moving away some blobs could move away all or just a selected
> subset.

Ok, I have done that in my current version.

> > I think it can work if the call to write_filtered_pack() is either
> > before the call to close_object_store() or after it. It would just use
> > the tempfiles with their temporary name in the first case and with
> > their final name in the second case.
> >
> > write_filtered_pack() is very similar to write_cruft_pack() which is
> > called before the call to close_object_store(), so I prefer to keep it
> > before that call too, if possible, for consistency.
>
> As long as the set-up is not racy, either would be OK, as the names
> are not recorded in the end result.
>
> If the upstream tells the downstream the temporary's name and then
> finializes the temporary to the final name before the downstream
> reacts to the input,

It doesn't seem to me that it's what happens.

We have the following order:

  - finish_pack_objects_cmd() is called for the first pack-objects
process. It populates the 'names' string_list with the temporary name
of the packfile it generated (which doesn't contain the filtered out
objects) and calls finish_command() to finish the first pack-objects
process. So as far as I understand nothing can be written anymore to
the packfile when finish_pack_objects_cmd() returns.

  - write_filtered_pack() is called. It starts the second pack-objects
process and passes it the temporary name of the packfile that was just
written, taking it from the 'names' string_list. It then calls
finish_pack_objects_cmd() for the second process which populates the
'names' string_list with the temporary name of the packfile created by
the second process and finishes the second process. So nothing can
then be written in the second packfile anymore.

  - close_object_store() is called which renames the packfiles from
the 'names' string_list giving them their final name.

So the final names are given only once both processes are finished and
both packfiles have been fully written.

> however, then by the time downstream starts
> working on the file, the file may not exist under its original,
> temporary name, and that kind of race was what I was worried about.
>
> > Perhaps this could be dealt with separately though, as I think we
> > might want to fix write_cruft_pack() first then.
>
> Sorry, I am not understanding this quite well.  Do you mean we
> should add one more known-to-be-racy-and-incorrect codepath because
> there is already a codepath that needs to be fixed anyway?

No.

> If write_cruft_pack() has a similar issue, then yeah, let's fix that
> first (testing might be tricky for any racy bugs).  And let's use
> the same technique as used to fix it in this series, too, so that we
> do not reintroduce a similar bug due to racy setup.

Yeah, that's what I mean.

I am not sure the race actually exists though. I have tried to explain
why it seems to me that things look correct, but from previous
experience I know that you are very often right, and I might have
missed something.

Thanks.
Junio C Hamano July 25, 2023, 5:25 p.m. UTC | #7
Christian Couder <christian.couder@gmail.com> writes:

> We have the following order:
>
>   - finish_pack_objects_cmd() is called for the first pack-objects
> process. It populates the 'names' string_list with the temporary name
> of the packfile it generated (which doesn't contain the filtered out
> objects) and calls finish_command() to finish the first pack-objects
> process. So as far as I understand nothing can be written anymore to
> the packfile when finish_pack_objects_cmd() returns.
>
>   - write_filtered_pack() is called. It starts the second pack-objects
> process and passes it the temporary name of the packfile that was just
> written, taking it from the 'names' string_list. It then calls
> finish_pack_objects_cmd() for the second process which populates the
> 'names' string_list with the temporary name of the packfile created by
> the second process and finishes the second process. So nothing can
> then be written in the second packfile anymore.
>
>   - close_object_store() is called which renames the packfiles from
> the 'names' string_list giving them their final name.

"which renames" -> "and then we enter a loop to rename" and
close_object_store() itself and its callees do not do much, but yes,
you are right.  As finish_pack_objects_cmd() is synchronous, there
cannot be such race as a feared (if we were feeding the pack objects
process that collects the objects that would have filtered out with
the final packfile paths, and if we were only renaming them to the
final paths after that close_object_store() call, then the process
would want to see the final names that are not there yet, but that's
not a race but a bug that would reliably trigger).

> So the final names are given only once both processes are finished and
> both packfiles have been fully written.

Thanks for walking through the codepaths involved.  We are good
then.
Junio C Hamano July 25, 2023, 11:08 p.m. UTC | #8
Junio C Hamano <gitster@pobox.com> writes:

> Thanks for walking through the codepaths involved.  We are good
> then.

Sorry, but not so fast.

https://github.com/git/git/actions/runs/5661445152 (seen with this topic)
https://github.com/git/git/actions/runs/5662517690 (seen w/o this topic)

The former fails t7700 in the linux-TEST-vars job, while the latter
passes the same job.

Thanks.
Christian Couder Aug. 8, 2023, 8:45 a.m. UTC | #9
On Wed, Jul 26, 2023 at 1:09 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Junio C Hamano <gitster@pobox.com> writes:
>
> > Thanks for walking through the codepaths involved.  We are good
> > then.
>
> Sorry, but not so fast.
>
> https://github.com/git/git/actions/runs/5661445152 (seen with this topic)
> https://github.com/git/git/actions/runs/5662517690 (seen w/o this topic)
>
> The former fails t7700 in the linux-TEST-vars job, while the latter
> passes the same job.

I think this was because I added the following test:

+test_expect_success '--filter fails with --write-bitmap-index' '
+    test_must_fail git -C bare.git repack -a -d --write-bitmap-index \
+        --filter=blob:none &&
+
+    git -C bare.git repack -a -d --no-write-bitmap-index \
+        --filter=blob:none
+'

which fails because in the linux-TEST-vars job the
GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP env variable is set to 1 and
this counteracts the `--write-bitmap-index` option.

I have tried to fix it like this:

+test_expect_success '--filter fails with --write-bitmap-index' '
+    GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 test_must_fail git -C
bare.git repack \
+        -a -d --write-bitmap-index --filter=blob:none
+'

but I haven't been able to check that this works on CI as all the job
seems to fail these days before they even start:

https://github.com/chriscool/git/actions/runs/5791544404/job/15696524676

Thanks!
Taylor Blau Aug. 9, 2023, 8:38 p.m. UTC | #10
On Tue, Aug 08, 2023 at 10:45:48AM +0200, Christian Couder wrote:
> On Wed, Jul 26, 2023 at 1:09 AM Junio C Hamano <gitster@pobox.com> wrote:
> >
> > Junio C Hamano <gitster@pobox.com> writes:
> >
> > > Thanks for walking through the codepaths involved.  We are good
> > > then.
> >
> > Sorry, but not so fast.
> >
> > https://github.com/git/git/actions/runs/5661445152 (seen with this topic)
> > https://github.com/git/git/actions/runs/5662517690 (seen w/o this topic)
> >
> > The former fails t7700 in the linux-TEST-vars job, while the latter
> > passes the same job.
>
> I think this was because I added the following test:
>
> +test_expect_success '--filter fails with --write-bitmap-index' '
> +    test_must_fail git -C bare.git repack -a -d --write-bitmap-index \
> +        --filter=blob:none &&
> +
> +    git -C bare.git repack -a -d --no-write-bitmap-index \
> +        --filter=blob:none
> +'
>
> which fails because in the linux-TEST-vars job the
> GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP env variable is set to 1 and
> this counteracts the `--write-bitmap-index` option.

Makes sense. That linux-TEST-vars job always seems to get me, too.

(As an aside, and definitely not related to your patch here, I wonder if
we should consider dropping some of the older TEST variables that belong
to features that we no longer consider experimental).

> I have tried to fix it like this:
>
> +test_expect_success '--filter fails with --write-bitmap-index' '
> +    GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 test_must_fail git -C
> bare.git repack \
> +        -a -d --write-bitmap-index --filter=blob:none
> +'
>
> but I haven't been able to check that this works on CI as all the job
> seems to fail these days before they even start:

I think the cannonical way to do this is with env, like so:

    test_must_fail env GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 \
      git -C bare.git repack -ad --write-bitmap-index --filter=blob:none 2>err

Thanks,
Taylor
Junio C Hamano Aug. 9, 2023, 10:50 p.m. UTC | #11
Christian Couder <christian.couder@gmail.com> writes:

> but I haven't been able to check that this works on CI as all the job
> seems to fail these days before they even start:

This remark made me worried, but (luckily) it does not seem to be
the case for other topics cooking in 'next' or queued in 'seen'.

It does seem that this topic directly queued on 'master' by itself
without any other topics do break the CI quite badly.  Almost
nothing passes:

  https://github.com/git/git/actions/runs/5812873987

but it may be something as silly as failing test lint.  I didn't
check very closely.

Thanks.
Junio C Hamano Aug. 9, 2023, 11:38 p.m. UTC | #12
Junio C Hamano <gitster@pobox.com> writes:

> It does seem that this topic directly queued on 'master' by itself
> without any other topics do break the CI quite badly.  Almost
> nothing passes:
>
>   https://github.com/git/git/actions/runs/5812873987
>
> but it may be something as silly as failing test lint.  I didn't
> check very closely.

And a bit further digging reveals that it is the case.

https://github.com/git/git/actions/runs/5812873987/job/15759211568#step:4:787

Locally you should be able to reproduce it by

    make
    make -C t test-lint

before sending your patches out.

I've queued a squashable fix-up on top of the topic.

Thanks.

---
 t/t7700-repack.sh | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index 9b1e189a62..48e92aa6f7 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -342,8 +342,9 @@ test_expect_success 'repacking with a filter works' '
 '
 
 test_expect_success '--filter fails with --write-bitmap-index' '
-	GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 test_must_fail git -C bare.git repack \
-		-a -d --write-bitmap-index --filter=blob:none
+	test_must_fail \
+		env GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 \
+		git -C bare.git repack -a -d --write-bitmap-index --filter=blob:none
 '
 
 test_expect_success 'repacking with two filters works' '
Jeff King Aug. 10, 2023, 12:10 a.m. UTC | #13
On Wed, Aug 09, 2023 at 04:38:45PM -0700, Junio C Hamano wrote:

> Locally you should be able to reproduce it by
> 
>     make
>     make -C t test-lint
> 
> before sending your patches out.

This shouldn't be necessary. "make test" is supposed to run test-lint
(and does catch this case for me). Provided you haven't overridden that
by setting TEST_LINT to some more limited set.

-Peff
diff mbox series

Patch

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 4017157949..d702553033 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -143,6 +143,15 @@  depth is 4095.
 	a larger and slower repository; see the discussion in
 	`pack.packSizeLimit`.
 
+--filter=<filter-spec>::
+	Remove objects matching the filter specification from the
+	resulting packfile and put them into a separate packfile. Note
+	that objects used in the working directory are not filtered
+	out. So for the split to fully work, it's best to perform it
+	in a bare repo and to use the `-a` and `-d` options along with
+	this option.  See linkgit:git-rev-list[1] for valid
+	`<filter-spec>` forms.
+
 -b::
 --write-bitmap-index::
 	Write a reachability bitmap index as part of the repack. This
diff --git a/builtin/repack.c b/builtin/repack.c
index 4e5afee8d8..e2661b956c 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -54,6 +54,7 @@  struct pack_objects_args {
 	const char *depth;
 	const char *threads;
 	const char *max_pack_size;
+	const char *filter;
 	int no_reuse_delta;
 	int no_reuse_object;
 	int quiet;
@@ -174,6 +175,8 @@  static void prepare_pack_objects(struct child_process *cmd,
 		strvec_pushf(&cmd->args, "--threads=%s", args->threads);
 	if (args->max_pack_size)
 		strvec_pushf(&cmd->args, "--max-pack-size=%s", args->max_pack_size);
+	if (args->filter)
+		strvec_pushf(&cmd->args, "--filter=%s", args->filter);
 	if (args->no_reuse_delta)
 		strvec_pushf(&cmd->args, "--no-reuse-delta");
 	if (args->no_reuse_object)
@@ -734,6 +737,57 @@  static int finish_pack_objects_cmd(struct child_process *cmd,
 	return finish_command(cmd);
 }
 
+static int write_filtered_pack(const struct pack_objects_args *args,
+			       const char *destination,
+			       const char *pack_prefix,
+			       struct string_list *names,
+			       struct string_list *existing_packs,
+			       struct string_list *existing_kept_packs)
+{
+	struct child_process cmd = CHILD_PROCESS_INIT;
+	struct string_list_item *item;
+	FILE *in;
+	int ret;
+	const char *scratch;
+	int local = skip_prefix(destination, packdir, &scratch);
+
+	/* We need to copy 'args' to modify it */
+	struct pack_objects_args new_args = *args;
+
+	/* No need to filter again */
+	new_args.filter = NULL;
+
+	prepare_pack_objects(&cmd, &new_args, destination);
+
+	strvec_push(&cmd.args, "--stdin-packs");
+
+	cmd.in = -1;
+
+	ret = start_command(&cmd);
+	if (ret)
+		return ret;
+
+	/*
+	 * names has a confusing double use: it both provides the list
+	 * of just-written new packs, and accepts the name of the
+	 * filtered pack we are writing.
+	 *
+	 * By the time it is read here, it contains only the pack(s)
+	 * that were just written, which is exactly the set of packs we
+	 * want to consider kept.
+	 */
+	in = xfdopen(cmd.in, "w");
+	for_each_string_list_item(item, names)
+		fprintf(in, "^%s-%s.pack\n", pack_prefix, item->string);
+	for_each_string_list_item(item, existing_packs)
+		fprintf(in, "%s.pack\n", item->string);
+	for_each_string_list_item(item, existing_kept_packs)
+		fprintf(in, "^%s.pack\n", item->string);
+	fclose(in);
+
+	return finish_pack_objects_cmd(&cmd, names, local);
+}
+
 static int write_cruft_pack(const struct pack_objects_args *args,
 			    const char *destination,
 			    const char *pack_prefix,
@@ -866,6 +920,8 @@  int cmd_repack(int argc, const char **argv, const char *prefix)
 				N_("limits the maximum number of threads")),
 		OPT_STRING(0, "max-pack-size", &po_args.max_pack_size, N_("bytes"),
 				N_("maximum size of each packfile")),
+		OPT_STRING(0, "filter", &po_args.filter, N_("args"),
+				N_("object filtering")),
 		OPT_BOOL(0, "pack-kept-objects", &pack_kept_objects,
 				N_("repack objects in packs marked with .keep")),
 		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
@@ -1105,6 +1161,17 @@  int cmd_repack(int argc, const char **argv, const char *prefix)
 		}
 	}
 
+	if (po_args.filter) {
+		ret = write_filtered_pack(&po_args,
+					  packtmp,
+					  find_pack_prefix(),
+					  &names,
+					  &existing_nonkept_packs,
+					  &existing_kept_packs);
+		if (ret)
+			goto cleanup;
+	}
+
 	string_list_sort(&names);
 
 	close_object_store(the_repository->objects);
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index af79266c58..66589e4217 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -293,6 +293,22 @@  test_expect_success 'auto-bitmaps do not complain if unavailable' '
 	test_must_be_empty actual
 '
 
+test_expect_success 'repacking with a filter works' '
+	git -C bare.git repack -a -d &&
+	test_stdout_line_count = 1 ls bare.git/objects/pack/*.pack &&
+	git -C bare.git -c repack.writebitmaps=false repack -a -d --filter=blob:none &&
+	test_stdout_line_count = 2 ls bare.git/objects/pack/*.pack &&
+	commit_pack=$(test-tool -C bare.git find-pack HEAD) &&
+	test -n "$commit_pack" &&
+	blob_pack=$(test-tool -C bare.git find-pack HEAD:file1) &&
+	test -n "$blob_pack" &&
+	test "$commit_pack" != "$blob_pack" &&
+	tree_pack=$(test-tool -C bare.git find-pack HEAD^{tree}) &&
+	test "$tree_pack" = "$commit_pack" &&
+	blob_pack2=$(test-tool -C bare.git find-pack HEAD:file2) &&
+	test "$blob_pack2" = "$blob_pack"
+'
+
 objdir=.git/objects
 midx=$objdir/pack/multi-pack-index