diff mbox series

[v2,3/3] upload-pack: allow configuring a missing-action

Message ID 20240515132543.851987-4-christian.couder@gmail.com (mailing list archive)
State Superseded
Headers show
Series upload-pack: support a missing-action | expand

Commit Message

Christian Couder May 15, 2024, 1:25 p.m. UTC
From: Christian Couder <chriscool@tuxfamily.org>

In case some objects are missing from a server, it might still be
useful to be able to fetch or clone from it if the client already has
the missing objects or can get them in some way.

For example, in case both the server and the client are using a
separate promisor remote that contain some objects, it can be better
if the server doesn't try to send such objects back to the client, but
instead let the client get those objects separately from the promisor
remote. (The client needs to have the separate promisor remote
configured, for that to work.)

Another example could be a server where some objects have been
corrupted or deleted. It could still be useful for clients who could
get those objects from another source, like perhaps a different
client, to be able to fetch or clone from the server.

To configure what a server should do if there are such missing
objects, let's teach `git upload-pack` a new
`uploadpack.missingAction` configuration variable.

This new missing-action works in a similar way as what `git rev-list`
and `git pack-objects` already support with their
`--missing=<missing-action>` command line options. In fact when
`uploadpack.missingAction` is set to something different than "error",
`git upload-pack` will just pass the corresponding
`--missing=<missing-action>` to `git pack-objects`.

So this new missing-action has the same limitations as
`git pack-objects --missing=<missing-action>`. Especially when not
using promisor remotes, 'allow-any' works only for blobs.

Another limitation comes from `git upload-pack` itself which requires
setting `GIT_NO_LAZY_FETCH` to 0 since 7b70e9efb1 (upload-pack:
disable lazy-fetching by default, 2024-04-16) for lazy fetching from
a promisor remote to work on the server side.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
---
 Documentation/config/uploadpack.txt |   9 ++
 missing.c                           |  16 ++++
 missing.h                           |   2 +
 t/t5706-upload-pack-missing.sh      | 125 ++++++++++++++++++++++++++++
 upload-pack.c                       |  19 +++++
 5 files changed, 171 insertions(+)
 create mode 100755 t/t5706-upload-pack-missing.sh

Comments

Junio C Hamano May 15, 2024, 5:08 p.m. UTC | #1
Christian Couder <christian.couder@gmail.com> writes:

> From: Christian Couder <chriscool@tuxfamily.org>
>
> In case some objects are missing from a server, it might still be
> useful to be able to fetch or clone from it if the client already has
> the missing objects or can get them in some way.

Be more assertive.  We do not want to add a new feature randomly
only because it _might_ be useful to somebody in a strange and
narrow use case that _might_ exist.

> For example, in case both the server and the client are using a
> separate promisor remote that contain some objects, it can be better
> if the server doesn't try to send such objects back to the client, but
> instead let the client get those objects separately from the promisor
> remote. (The client needs to have the separate promisor remote
> configured, for that to work.)

Is it "it can be better", or is it "it is always better"?  Pick an
example that you can say the latter to make your example more
convincing.

Repository S borrows from its "promisor" X, and repository C which
initially cloned from S borrows from its "promisor" S.  Even if C
wants an object in order to fill in the gap in its object graph, S
may not have it (and S itself may have no need for that object), and
in such a case, bypassing S and let C go directly to X would make
sense.

> Another example could be a server where some objects have been
> corrupted or deleted. It could still be useful for clients who could
> get those objects from another source, like perhaps a different
> client, to be able to fetch or clone from the server.
>
> To configure what a server should do if there are such missing
> objects, let's teach `git upload-pack` a new
> `uploadpack.missingAction` configuration variable.
>
> This new missing-action works in a similar way as what `git rev-list`
> and `git pack-objects` already support with their
> `--missing=<missing-action>` command line options. In fact when
> `uploadpack.missingAction` is set to something different than "error",
> `git upload-pack` will just pass the corresponding
> `--missing=<missing-action>` to `git pack-objects`.
>
> So this new missing-action has the same limitations as
> `git pack-objects --missing=<missing-action>`. Especially when not
> using promisor remotes, 'allow-any' works only for blobs.
>
> Another limitation comes from `git upload-pack` itself which requires
> setting `GIT_NO_LAZY_FETCH` to 0 since 7b70e9efb1 (upload-pack:
> disable lazy-fetching by default, 2024-04-16) for lazy fetching from
> a promisor remote to work on the server side.

I am puzzled by this new option.

It feels utterly irresponsible to give an option to set up a server
that essentially declares: I'll serve objects you ask me as best
efforts basis, the pack stream I'll give you may not have all
objects you asked for and missing some objects, and when I do so, I
am not telling you which objects I omitted.

How do you ensure that a response with an incomplete pack data would
not corrupt the repository when the sending side configures
upload-pack with this option?  How does the receiving end know which
objects it needs to ask from elsewhere?

Or is the data integrity of the receiving repository is the
responsibility of the receiving user that talks with such a server?
If that is the case, I am not sure if I want to touch such a feature
even with 10ft pole.

Is there anything the sender can do but does not do to help the
receiver locate where to fetch these missing objects to fill the
"unfilled promises"?

For example, the sending side _could_ say that "Sorry, I do not have
all objects that you asked me to---but you could try these other
repositories".
Christian Couder May 24, 2024, 4:41 p.m. UTC | #2
On Wed, May 15, 2024 at 7:09 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> > From: Christian Couder <chriscool@tuxfamily.org>
> >
> > In case some objects are missing from a server, it might still be
> > useful to be able to fetch or clone from it if the client already has
> > the missing objects or can get them in some way.
>
> Be more assertive.  We do not want to add a new feature randomly
> only because it _might_ be useful to somebody in a strange and
> narrow use case that _might_ exist.

Ok, I have changed "it might still be useful" to "it is sometimes
useful" in the v3 I just sent.

> > For example, in case both the server and the client are using a
> > separate promisor remote that contain some objects, it can be better
> > if the server doesn't try to send such objects back to the client, but
> > instead let the client get those objects separately from the promisor
> > remote. (The client needs to have the separate promisor remote
> > configured, for that to work.)
>
> Is it "it can be better", or is it "it is always better"?  Pick an
> example that you can say the latter to make your example more
> convincing.
>
> Repository S borrows from its "promisor" X, and repository C which
> initially cloned from S borrows from its "promisor" S.  Even if C
> wants an object in order to fill in the gap in its object graph, S
> may not have it (and S itself may have no need for that object), and
> in such a case, bypassing S and let C go directly to X would make
> sense.

Ok, I use your example above in v3.

> I am puzzled by this new option.
>
> It feels utterly irresponsible to give an option to set up a server
> that essentially declares: I'll serve objects you ask me as best
> efforts basis, the pack stream I'll give you may not have all
> objects you asked for and missing some objects, and when I do so, I
> am not telling you which objects I omitted.

I don't think it's irresponsible. The client anyways checks that it
got something usable in the same way as it does when it performs a
partial fetch or clone. The fetch or clone fails if that's not the
case. For example if the checkout part of a clone needs some objects
but cannot get them, the whole clone fails.

> How do you ensure that a response with an incomplete pack data would
> not corrupt the repository when the sending side configures
> upload-pack with this option?  How does the receiving end know which
> objects it needs to ask from elsewhere?

Git already has support for multiple promisor remotes. When a repo is
configured with 2 promisor remotes, let's call them A and B, then A
cannot guarantee that B has all the objects that are missing in A. And
it's the same for B, it cannot guarantee A has all the objects missing
in B. Also when fetching or cloning from A for example, then no list
of missing objects is transfered.

> Or is the data integrity of the receiving repository is the
> responsibility of the receiving user that talks with such a server?

Yes, it's the same as when a repo is configured with multiple promisor remotes.

When using bundle-uri, it's also the responsibility of the receiving
user to check that the bundle it gets from a separate endpoint is
correct.

Also when only large blobs are on remotes except for the main remote,
then, after cloning or fetching from the main remote, the client knows
the hashes of the objects it should get from the other remotes. So
there is still data integrity.

> If that is the case, I am not sure if I want to touch such a feature
> even with 10ft pole.

To give you some context, at GitLab we have a very large part of the
disk space on our servers taken by large blobs which often don't
compress well (see
https://gitlab.com/gitlab-org/gitaly/-/issues/5699#note_1794464340).
It makes a lot of sense for us at GitLab to try to move those large
blobs to some cheaper separate storage.

With repack --filter=... we can put large blobs on a separate path,
but it has some drawbacks:
  - they are still part of a packfile, which means that storing and
accessing the objects is expensive in terms of CPU and memory (large
binary files are often already compressed and might not delta well
with other objects),
  - mounting object storage on a machine might not be easy or might
not perform as well as using an object storage API.

So using a separate remote along with a remote helper for large blobs
makes sense. And when a client is cloning, it makes sense for a
regular Git server, to let the client fetch the large blobs directly
from a remote that has them.

> Is there anything the sender can do but does not do to help the
> receiver locate where to fetch these missing objects to fill the
> "unfilled promises"?
>
> For example, the sending side _could_ say that "Sorry, I do not have
> all objects that you asked me to---but you could try these other
> repositories".

In the cover letter of this v2 and the v3, I suggested the following:

---
For example in case of a client
cloning, something like the following is currently needed:

  GIT_NO_LAZY_FETCH=0 git clone
      -c remote.my_promisor.promisor=true \
      -c remote.my_promisor.fetch="+refs/heads/*:refs/remotes/my_promisor/*" \
      -c remote.my_promisor.url=<MY_PROMISOR_URL> \
      --filter="blob:limit=5k" server

But it would be nice if there was a capability for the client to say
that it would like the server to give it information about the
promisor that it could use, so that the user doesn't have to pass all
the "remote.my_promisor.XXX" config options on the command like. (It
would then be a bit similar to the bundle-uri feature where all the
bundle related information comes from the server.)
---

So yes we are thinking about adding such a way to "help the receiver
locate where to fetch these missing objects" soon. We needed to start
somewhere and we decided to start with this series, because this patch
series is quite small and let us already experiment with offloading
blobs to object storage (see
https://gitlab.com/gitlab-org/gitaly/-/issues/5987).

Also the client will anyway very likely store the information about
where it can get the missing objects as a promisor remote
configuration in its config. So after the clone, the resulting repo
will very likely be very similar as what it would be with the clone
command above.
Junio C Hamano May 24, 2024, 9:51 p.m. UTC | #3
Christian Couder <christian.couder@gmail.com> writes:

>> Repository S borrows from its "promisor" X, and repository C which
>> initially cloned from S borrows from its "promisor" S.  Even if C
>> wants an object in order to fill in the gap in its object graph, S
>> may not have it (and S itself may have no need for that object), and
>> in such a case, bypassing S and let C go directly to X would make
>> sense.
> ...
>>
>> It feels utterly irresponsible to give an option to set up a server
>> that essentially declares: I'll serve objects you ask me as best
>> efforts basis, the pack stream I'll give you may not have all
>> objects you asked for and missing some objects, and when I do so, I
>> am not telling you which objects I omitted.
>
> I don't think it's irresponsible. The client anyways checks that it
> got something usable in the same way as it does when it performs a
> partial fetch or clone. The fetch or clone fails if that's not the
> case. For example if the checkout part of a clone needs some objects
> but cannot get them, the whole clone fails.

But then what can the repository C do after seeing such a failure?

With the design, S does not even consult C to see if C knows about
X.  Without knowing that, it cannot safely decide that it does not
have to send objects that can be obtained from X to C.  Instead, S
simply say "if C requests an object that I do not have, just ignore
it and let C grab it from somewhere else".  How would it not be an
irresponsible design?
Christian Couder May 28, 2024, 10:10 a.m. UTC | #4
On Fri, May 24, 2024 at 11:51 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> >> Repository S borrows from its "promisor" X, and repository C which
> >> initially cloned from S borrows from its "promisor" S.  Even if C
> >> wants an object in order to fill in the gap in its object graph, S
> >> may not have it (and S itself may have no need for that object), and
> >> in such a case, bypassing S and let C go directly to X would make
> >> sense.
> > ...
> >>
> >> It feels utterly irresponsible to give an option to set up a server
> >> that essentially declares: I'll serve objects you ask me as best
> >> efforts basis, the pack stream I'll give you may not have all
> >> objects you asked for and missing some objects, and when I do so, I
> >> am not telling you which objects I omitted.
> >
> > I don't think it's irresponsible. The client anyways checks that it
> > got something usable in the same way as it does when it performs a
> > partial fetch or clone. The fetch or clone fails if that's not the
> > case. For example if the checkout part of a clone needs some objects
> > but cannot get them, the whole clone fails.
>
> But then what can the repository C do after seeing such a failure?

It's basically the same as when a regular clone or a partial clone or
a clone using bundle-uri fails or when using a regular bundle fails.
If it failed because the remote was not properly configured, then that
config can be fixed. If it fails because the remote doesn't have some
objects, then maybe the missing objects can be transferred to the
remote. And so on.

The feature doesn't create any new kind of failure. In particular,
when you use a partial clone, even a very simple one with a single
remote, there is always the risk of not being able to get some missing
objects as there is the risk of the remote being unreachable for some
reason (like if you take a plane and don't have an internet
connection, or if there is an outage on the server side). There are
some added risks because the feature requires added configuration and
it can be wrong like any configuration, and because there are 2
remotes instead of just one. But these are not new kinds of risks.
These risks already exist if one uses multiple promisor remotes.

> With the design, S does not even consult C to see if C knows about
> X.

If S is managed by a company like GitLab or GitHub, then S will
certainly advertise, for example by showing a command that can easily
be copy-pasted from the web page of the project onto the user's
command line, some way for C to use X.

In the cover letter I give the example of the following command that
can be used (and advertised by S):

  GIT_NO_LAZY_FETCH=0 git clone
      -c remote.my_promisor.promisor=true \
      -c remote.my_promisor.fetch="+refs/heads/*:refs/remotes/my_promisor/*" \
      -c remote.my_promisor.url=<MY_PROMISOR_URL> \
      --filter="blob:limit=5k" server

I also agree in the cover letter that this is not the most user
friendly clone command and I suggest that I could work on improving on
that by saying:

"it would be nice if there was a capability for the client to say
that it would like the server to give it information about the
promisor that it could use, so that the user doesn't have to pass all
the "remote.my_promisor.XXX" config options on the command like."

and by saying that this could be added later.

If you think that such a capability should definitely be part of this
work, for example because it wouldn't be sane to require users to use
such a long and complex command and it could avoid difficult to debug
failures, then I would be willing to work on this and add it to this
series.

> Without knowing that, it cannot safely decide that it does not
> have to send objects that can be obtained from X to C.

In the above command C is asking for a partial clone, as it uses a
--filter option. This means that C knows very well that it might not
get from S all the objects needed for a complete object graph. So why
can't S safely decide not to send some objects to C? Why would it be
Ok if C wanted a partial clone but didn't want to get some objects
from X at the same time, but would not be Ok if C wants the same
partial clone but also with the possibility to get some of the objects
from X right away? To me it seems less risky to ask for some objects
from X right away.

>  Instead, S
> simply say "if C requests an object that I do not have, just ignore
> it and let C grab it from somewhere else".  How would it not be an
> irresponsible design?

Again when using a regular partial clone omitting the same set of
objects, C also requests some objects that S doesn't have. And this is
not considered an issue or something irresponsible. It already works
like this. And then C still has the possibility to configure X as a
promisor remote and get missing objects from there. So why is it Ok
when it's done in several steps but not in one?
Junio C Hamano May 28, 2024, 3:54 p.m. UTC | #5
Christian Couder <christian.couder@gmail.com> writes:

> It's basically the same as when a regular clone or a partial clone or
> a clone using bundle-uri fails or when using a regular bundle fails.
> If it failed because the remote was not properly configured, then that
> config can be fixed. If it fails because the remote doesn't have some
> objects, then maybe the missing objects can be transferred to the
> remote. And so on.

> The feature doesn't create any new kind of failure.

> "it would be nice if there was a capability for the client to say
> that it would like the server to give it information about the
> promisor that it could use, so that the user doesn't have to pass all
> the "remote.my_promisor.XXX" config options on the command like."
>
> and by saying that this could be added later.

Can such an update happen transparently, or does it need changes to
end-user experience?  It is of dubious value to leave the initial
series be incomplete if the latter is the case.

>> Without knowing that, it cannot safely decide that it does not
>> have to send objects that can be obtained from X to C.
>
> In the above command C is asking for a partial clone, as it uses a
> --filter option. This means that C knows very well that it might not
> get from S all the objects needed for a complete object graph.

Hmph.  If C asks a partial clone and S is willing to be the promisor
for C, S is essentially saying that it will serve C any objects on
demand that are reachable from any object it served C in the past,
forever, no?  It might not get from S initially all the objects, but
if it wants later, S promises to let C have them.


> Again when using a regular partial clone omitting the same set of
> objects, C also requests some objects that S doesn't have.  And
> this is not considered an issue or something irresponsible. It
> already works like this.

"S doesn't have" is because S has pruned objects that it shouldn't
have in order to keep the promise it made earlier to C, right?  If
that is the case, I would very much say S is being irresponsible in
that case.

> And then C still has the possibility to configure X as a
> promisor remote and get missing objects from there. So why is it Ok
> when it's done in several steps but not in one?

You are right that S can decide to unilaterally break the promise it
made C, so this update is not making it any worse than giving users
a broken implementation of promisor remotes.  I wouldn't call it OK,
though.

If somebody identifies that even without this series, S can lead to
repository corruption at C by pruning objects it does need to keep
its promise to C, the next action I expect from healthy project is
to try coming up with a mechanism to make it less likely that such a
pruning happens by accident (e.g., by noticing allowAnySHA1InWant as
a sign that the repository has promised others to serve anything
that used to be reachable from anything it historically served,
disabling repack "-d" and instead send the currently unreachable
objects to an archived pack, and something equally silly like that).
It certainly is not to add a new mechanism to make it even easier to
configure S to break promised it made to C.

So, I dunno.
Christian Couder May 31, 2024, 8:43 p.m. UTC | #6
On Tue, May 28, 2024 at 5:54 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:

> > "it would be nice if there was a capability for the client to say
> > that it would like the server to give it information about the
> > promisor that it could use, so that the user doesn't have to pass all
> > the "remote.my_promisor.XXX" config options on the command like."
> >
> > and by saying that this could be added later.
>
> Can such an update happen transparently, or does it need changes to
> end-user experience?  It is of dubious value to leave the initial
> series be incomplete if the latter is the case.

I think adding the capability could happen transparently. I think the
capability should be a generic one about the server suggesting some
config options and perhaps some command line options, and the client
agreeing (or not) to setting the config options and using the command
line options that the server suggests. Of course the important thing
when implementing this capability is to properly limit the config and
command line options (and perhaps their values) that the client
accepts to set and use. But I think it could be a generic capability
that could be extended to other options in the future. That's why I
think it could be a separate patch series.

> >> Without knowing that, it cannot safely decide that it does not
> >> have to send objects that can be obtained from X to C.
> >
> > In the above command C is asking for a partial clone, as it uses a
> > --filter option. This means that C knows very well that it might not
> > get from S all the objects needed for a complete object graph.
>
> Hmph.  If C asks a partial clone and S is willing to be the promisor
> for C, S is essentially saying that it will serve C any objects on
> demand that are reachable from any object it served C in the past,
> forever, no?  It might not get from S initially all the objects, but
> if it wants later, S promises to let C have them.

This promise is not broken, as S can still get the missing objects
from X and then pass them to C. There is even the following test case
in the patch that shows that it works when uploadpack.missingAction is
unset (and thus default to "error" which is the same as how things
currently work):

+test_expect_success "fetch without uploadpack.missingAction=allow
-promisor" '
+       git -C server config --unset uploadpack.missingAction &&
+
+       # Clone from server to create a client
+       GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+               -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+               -c remote.server2.url="file://$(pwd)/server2" \
+               --no-local --filter="blob:limit=5k" server client &&
+       test_when_finished "rm -rf client" &&
+
+       # Check that the largest object is not missing on the server anymore
+       check_missing_objects server 0 ""
+'

The only difference with the case when uploadpack.missingAction is set
to "allow-promisor" is that then C gets the missing objects directly
from X without S getting them first. This is why there is:

+       # Check that the largest object is still missing on the server
+       check_missing_objects server 1 "$oid"

at the end of the test case when uploadpack.missingAction is set to
"allow-promisor". This is to check that S didn't get the missing
object (which means that the missing object went directly from X to
C).

So S keeps its promise to let C have any object through S if C wants.
It's just that having large objects through S (instead of directly
from X) is not a good idea because it copies the large objects to S
first which uses up space (and memory and CPU when creating packfiles)
on S while the objects are already on X where disk space is probably
cheaper (and where less CPU and memory will be used to send the
objects).

> > Again when using a regular partial clone omitting the same set of
> > objects, C also requests some objects that S doesn't have.  And
> > this is not considered an issue or something irresponsible. It
> > already works like this.
>
> "S doesn't have" is because S has pruned objects that it shouldn't
> have in order to keep the promise it made earlier to C, right?  If
> that is the case, I would very much say S is being irresponsible in
> that case.

Yes, S has pruned some objects, but these objects have been moved to X
first and S has access to X, so S can still get the objects. So I
don't understand why you say S shouldn't have pruned them and S is
irresponsible. S is still capable of fulfilling the promise it made.

And even if S was not capable of directly fulfilling the promise it
made, it would still not be clear to me that S is irresponsible, as S
and X can be seen as a single entity from the client point of view,
and S and X together would still be capable of fulfilling the promise
that was made to C.

When bundle-uri is used, for example, the server is not fulfilling the
promise alone. It needs the bundle server to work properly for the
client to get everything. When using Git LFS, the server is not
fulfilling the promise alone either. For things to work properly both
the Git server and the LFS server have to work together.

So if it's Ok for other features to require an additional server to
fulfill the promise, why is it not Ok in the case of S + X?

> > And then C still has the possibility to configure X as a
> > promisor remote and get missing objects from there. So why is it Ok
> > when it's done in several steps but not in one?
>
> You are right that S can decide to unilaterally break the promise it
> made C, so this update is not making it any worse than giving users
> a broken implementation of promisor remotes.  I wouldn't call it OK,
> though.

I don't understand why you compare this to a "broken" implementation
of promisor remotes. What could then be a non-broken one that would
store large blobs on a separate server in your opinion? I am really
interested in answers to this question. It's not a rhetorical one.

> If somebody identifies that even without this series, S can lead to
> repository corruption at C by pruning objects it does need to keep
> its promise to C,

Objects are not pruned if you consider S + X instead of just S. It's
the same as when people are using Git LFS or bundle-uri. Nothing is
really pruned on the server side, no promise is broken. Some objects
are just moved to a separate server that is sometimes called to
provide some objects on top of those the Git server still sends.

> the next action I expect from healthy project is
> to try coming up with a mechanism to make it less likely that such a
> pruning happens by accident

When people managing server S decide to move some objects to X and make S
access X, it's not accidental pruning. But maybe you are talking
about the case when uploadpack.missingAction is set to "allow-any"
instead of "allow-promisor"?

> (e.g., by noticing allowAnySHA1InWant as
> a sign that the repository has promised others to serve anything
> that used to be reachable from anything it historically served,
> disabling repack "-d" and instead send the currently unreachable
> objects to an archived pack, and something equally silly like that).
> It certainly is not to add a new mechanism to make it even easier to
> configure S to break promised it made to C.

This new mechanism has some important benefits. Yes, there are risks,
but it's our opinion at GitLab, with some data from GitLab servers
(see the link I previously sent to a GitLab issue) to back it up, that
the rewards are well worth the added risks.

I agree that ideally a Git server should be able to handle every kind
of object very well, but that's unfortunately not the case. Git LFS
has some drawbacks, bundle-uri also has some drawbacks, but there are
reasons why they exist. Multiple promisor remote is a solution that
already exists in Git too. Yes, it also has some drawbacks, but they
are in many ways similar to those of bundle-uri and Git LFS. Yes, this
patch is making multiple promisor remotes easier to use, but it
doesn't fundamentally change the risks associated with using multiple
promisor remotes.
Junio C Hamano June 1, 2024, 9:43 a.m. UTC | #7
Christian Couder <christian.couder@gmail.com> writes:

>> Hmph.  If C asks a partial clone and S is willing to be the promisor
>> for C, S is essentially saying that it will serve C any objects on
>> demand that are reachable from any object it served C in the past,
>> forever, no?  It might not get from S initially all the objects, but
>> if it wants later, S promises to let C have them.
>
> This promise is not broken, as S can still get the missing objects
> from X and then pass them to C. There is even the following test case
> in the patch that shows that it works when uploadpack.missingAction is
> unset (and thus default to "error" which is the same as how things
> currently work):

And the whole point of that configuration is to make it easier for S
to break that promise, no?  uploadPack.missingAction is set at S and
is not under control of C, right?

> So S keeps its promise to let C have any object through S if C wants.
> It's just that having large objects through S (instead of directly
> from X) is not a good idea ...
> ...
> So if it's Ok for other features to require an additional server to
> fulfill the promise, why is it not Ok in the case of S + X?

I am questioning the design that does not give C any say in the
decision if it is a good idea or not to ask S relay large objects.
S unilaterally decides that it does not want to and does not serve
such large objects to C, and without even checking with C if it can
reach X to fetch directly, silently assuming that C will do so,
right?  It is quite different from the contract between C and S in
the simpler world.

> I don't understand why you compare this to a "broken" implementation
> of promisor remotes. What could then be a non-broken one that would
> store large blobs on a separate server in your opinion? I am really
> interested in answers to this question. It's not a rhetorical one.

You as S would tell C "I want you to go to X because I am not
serving objects X and Y".  Or at least make sure that C knows about
X before deciding to omit what X ought to have.  Not doing anything
and silently assuming that C will get them from elsewhere is simply
irresponsible, especially if C is not even told about X over the
protocol, no?
Christian Couder June 3, 2024, 3:01 p.m. UTC | #8
On Sat, Jun 1, 2024 at 11:43 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:

> > I don't understand why you compare this to a "broken" implementation
> > of promisor remotes. What could then be a non-broken one that would
> > store large blobs on a separate server in your opinion? I am really
> > interested in answers to this question. It's not a rhetorical one.
>
> You as S would tell C "I want you to go to X because I am not
> serving objects X and Y".  Or at least make sure that C knows about
> X before deciding to omit what X ought to have.

Ok, so if there was a way for S to suggest config options and perhaps
also command line options to C (like I mentioned in a previous email),
and if S would suggest C to configure X as a promisor remote and C
would accept to do that, then it would be fine for the feature to
work, right?

Alternatively, if C would pass a new option called for example
--known-promisor=X on top of all other options, then that could be Ok
too?
Junio C Hamano June 3, 2024, 5:29 p.m. UTC | #9
Christian Couder <christian.couder@gmail.com> writes:

> Alternatively, if C would pass a new option called for example
> --known-promisor=X on top of all other options, then that could be Ok
> too?

Why do you need a new option "--known-promisor=X" in the first
place?  Doesn't it known by the repository already and you do not
have to bother the end user to supply it from the command line, no?

In any case, a protocol extension that lets S tell C that S wants C
to fetch missing objects from X (which means that if C knows about X
in its ".git/config" then there is no need for end-user interaction
at all), or a protocol extension that C tells S that C is willing to
see objects available from X omitted when S does not have them
(again, this could be done by looking at ".git/config" at C, but
there may be security implications???).  Either would work OK and I
offhand do not see much preference between the two.  

What is important is to ensure that such deliberate omission by S is
safe and acceptable by C.

Thanks.


[Footnote]

By the way, I briefly wondered: would "X" that is valid for "C" be
always valid for "S" these days?

I as "S" may prefer you as "C" to go ssh://gitlab.com/x.git but you
may only be able to go https://gitlab.com/x.git for networking
reasons.  Silly situation like that may make communicating "X"
between "S" and "C" a bit harder than we would like.  Of course this
is not a high priority issue as the user with such networking need
are probably already using url.X.insteadof to rewrite the submodule
paths recommended in .gitmodules after they clone a superproject,
iow, even if it were a problem, it is a solved one.

So, in short, yes, exchanging the known promisor "X" between "S" and
"C", regardless of which side speaks first, as long as the negotiation
makes sure both sides agree, would be a good direction to go, I
would think.
diff mbox series

Patch

diff --git a/Documentation/config/uploadpack.txt b/Documentation/config/uploadpack.txt
index 16264d82a7..cd70e853b3 100644
--- a/Documentation/config/uploadpack.txt
+++ b/Documentation/config/uploadpack.txt
@@ -82,3 +82,12 @@  uploadpack.allowRefInWant::
 	is intended for the benefit of load-balanced servers which may
 	not have the same view of what OIDs their refs point to due to
 	replication delay.
+
+uploadpack.missingAction::
+	If this option is set, `upload-pack` will call
+	linkgit:git-pack-objects[1] passing it the corresponding
+	`--missing=<missing-action>` command line option. See the
+	documentation for that option, to see the valid values and
+	their meaning. This could be useful if some objects are
+	missing on a server, but a client already has them or could
+	still get them from somewhere else.
diff --git a/missing.c b/missing.c
index ce3cf734a8..75036d2089 100644
--- a/missing.c
+++ b/missing.c
@@ -18,3 +18,19 @@  int parse_missing_action_value(const char *value)
 
 	return -1;
 }
+
+const char *missing_action_to_string(enum missing_action action)
+{
+	switch (action) {
+	case MA_ERROR:
+		return "error";
+	case MA_ALLOW_ANY:
+		return "allow-any";
+	case MA_PRINT:
+		return "print";
+	case MA_ALLOW_PROMISOR:
+		return "allow-promisor";
+	default:
+		BUG("invalid missing action %d", action);
+	}
+}
diff --git a/missing.h b/missing.h
index 1e378d6215..74753d7887 100644
--- a/missing.h
+++ b/missing.h
@@ -14,4 +14,6 @@  enum missing_action {
 */
 int parse_missing_action_value(const char *value);
 
+const char *missing_action_to_string(enum missing_action action);
+
 #endif /* MISSING_H */
diff --git a/t/t5706-upload-pack-missing.sh b/t/t5706-upload-pack-missing.sh
new file mode 100755
index 0000000000..0f39c0ddd0
--- /dev/null
+++ b/t/t5706-upload-pack-missing.sh
@@ -0,0 +1,125 @@ 
+#!/bin/sh
+
+test_description='handling of missing objects in upload-pack'
+
+TEST_PASSES_SANITIZE_LEAK=true
+. ./test-lib.sh
+
+# Setup the repository with three commits, this way HEAD is always
+# available and we can hide commit 1 or 2.
+test_expect_success 'setup: create "template" repository' '
+	git init template &&
+	test_commit -C template 1 &&
+	test_commit -C template 2 &&
+	test_commit -C template 3 &&
+	test-tool genrandom foo 10240 >template/foo &&
+	git -C template add foo &&
+	git -C template commit -m foo
+'
+
+# A bare repo will act as a server repo with unpacked objects.
+test_expect_success 'setup: create bare "server" repository' '
+	git clone --bare --no-local template server &&
+	mv server/objects/pack/pack-* . &&
+	packfile=$(ls pack-*.pack) &&
+	git -C server unpack-objects --strict <"$packfile"
+'
+
+# Fetching with 'uploadpack.missingAction=allow-any' only works with
+# blobs, as `git pack-objects --missing=allow-any` fails if a missing
+# object is not a blob.
+test_expect_success "fetch with uploadpack.missingAction=allow-any" '
+	oid="$(git -C server rev-parse HEAD:1.t)" &&
+	oid_path="$(test_oid_to_path $oid)" &&
+	path="server/objects/$oid_path" &&
+
+	mv "$path" "$path.hidden" &&
+	test_when_finished "mv $path.hidden $path" &&
+
+	git init client &&
+	test_when_finished "rm -rf client" &&
+
+	# Client needs the missing objects to be available somehow
+	client_path="client/.git/objects/$oid_path" &&
+	mkdir -p $(dirname "$client_path") &&
+	cp "$path.hidden" "$client_path" &&
+
+	test_must_fail git -C client fetch ../server &&
+	git -C server config uploadpack.missingAction error &&
+	test_must_fail git -C client fetch ../server &&
+	git -C server config uploadpack.missingAction allow-any &&
+	git -C client fetch ../server &&
+	git -C server config --unset uploadpack.missingAction
+'
+
+check_missing_objects () {
+	git -C "$1" rev-list --objects --all --missing=print > all.txt &&
+	sed -n "s/^\?\(.*\)/\1/p" <all.txt >missing.txt &&
+	test_line_count = "$2" missing.txt &&
+	test "$3" = "$(cat missing.txt)"
+}
+
+test_expect_success "setup for testing uploadpack.missingAction=allow-promisor" '
+	# Create another bare repo called "server2"
+	git init --bare server2 &&
+
+	# Copy the largest object from server to server2
+	obj="HEAD:foo" &&
+	oid="$(git -C server rev-parse $obj)" &&
+	oid_path="$(test_oid_to_path $oid)" &&
+	path="server/objects/$oid_path" &&
+	path2="server2/objects/$oid_path" &&
+	mkdir -p $(dirname "$path2") &&
+	cp "$path" "$path2" &&
+
+	# Repack everything first
+	git -C server -c repack.writebitmaps=false repack -a -d &&
+
+	# Repack without the largest object and create a promisor pack on server
+	git -C server -c repack.writebitmaps=false repack -a -d \
+	    --filter=blob:limit=5k --filter-to="$(pwd)" &&
+	promisor_file=$(ls server/objects/pack/*.pack | sed "s/\.pack/.promisor/") &&
+	> "$promisor_file" &&
+
+	# Check that only one object is missing on the server
+	check_missing_objects server 1 "$oid" &&
+
+	# Configure server2 as promisor remote for server
+	git -C server remote add server2 "file://$(pwd)/server2" &&
+	git -C server config remote.server2.promisor true &&
+
+	git -C server2 config uploadpack.allowFilter true &&
+	git -C server2 config uploadpack.allowAnySHA1InWant true &&
+	git -C server config uploadpack.allowFilter true &&
+	git -C server config uploadpack.allowAnySHA1InWant true
+'
+
+test_expect_success "fetch with uploadpack.missingAction=allow-promisor" '
+	git -C server config uploadpack.missingAction allow-promisor &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+		-c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+		-c remote.server2.url="file://$(pwd)/server2" \
+		--no-local --filter="blob:limit=5k" server client &&
+	test_when_finished "rm -rf client" &&
+
+	# Check that the largest object is still missing on the server
+	check_missing_objects server 1 "$oid"
+'
+
+test_expect_success "fetch without uploadpack.missingAction=allow-promisor" '
+	git -C server config --unset uploadpack.missingAction &&
+
+	# Clone from server to create a client
+	GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \
+		-c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \
+		-c remote.server2.url="file://$(pwd)/server2" \
+		--no-local --filter="blob:limit=5k" server client &&
+	test_when_finished "rm -rf client" &&
+
+	# Check that the largest object is not missing on the server anymore
+	check_missing_objects server 0 ""
+'
+
+test_done
diff --git a/upload-pack.c b/upload-pack.c
index 902144b9d3..f114f92197 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -29,6 +29,8 @@ 
 #include "write-or-die.h"
 #include "json-writer.h"
 #include "strmap.h"
+#include "missing.h"
+#include "object-file.h"
 
 /* Remember to update object flag allocation in object.h */
 #define THEY_HAVE	(1u << 11)
@@ -96,6 +98,8 @@  struct upload_pack_data {
 
 	const char *pack_objects_hook;
 
+	enum missing_action missing_action;
+
 	unsigned stateless_rpc : 1;				/* v0 only */
 	unsigned no_done : 1;					/* v0 only */
 	unsigned daemon_mode : 1;				/* v0 only */
@@ -315,6 +319,9 @@  static void create_pack_file(struct upload_pack_data *pack_data,
 		strvec_push(&pack_objects.args, "--delta-base-offset");
 	if (pack_data->use_include_tag)
 		strvec_push(&pack_objects.args, "--include-tag");
+	if (pack_data->missing_action)
+		strvec_pushf(&pack_objects.args, "--missing=%s",
+			     missing_action_to_string(pack_data->missing_action));
 	if (pack_data->filter_options.choice) {
 		const char *spec =
 			expand_list_objects_filter_spec(&pack_data->filter_options);
@@ -1371,6 +1378,18 @@  static int upload_pack_config(const char *var, const char *value,
 		precomposed_unicode = git_config_bool(var, value);
 	} else if (!strcmp("transfer.advertisesid", var)) {
 		data->advertise_sid = git_config_bool(var, value);
+	} else if (!strcmp("uploadpack.missingaction", var)) {
+		int res = parse_missing_action_value(value);
+		if (res < 0 || (res != MA_ERROR &&
+				res != MA_ALLOW_ANY &&
+				res != MA_ALLOW_PROMISOR))
+			die(_("invalid value for '%s': '%s'"), var, value);
+		/* Allow fetching only from promisor remotes */
+		if (res == MA_ALLOW_PROMISOR)
+			fetch_if_missing = 1;
+		if (res == MA_ALLOW_ANY)
+			fetch_if_missing = 0;
+		data->missing_action = res;
 	}
 
 	if (parse_object_filter_config(var, value, ctx->kvi, data) < 0)