Message ID | 20240515132543.851987-4-christian.couder@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | upload-pack: support a missing-action | expand |
Christian Couder <christian.couder@gmail.com> writes: > From: Christian Couder <chriscool@tuxfamily.org> > > In case some objects are missing from a server, it might still be > useful to be able to fetch or clone from it if the client already has > the missing objects or can get them in some way. Be more assertive. We do not want to add a new feature randomly only because it _might_ be useful to somebody in a strange and narrow use case that _might_ exist. > For example, in case both the server and the client are using a > separate promisor remote that contain some objects, it can be better > if the server doesn't try to send such objects back to the client, but > instead let the client get those objects separately from the promisor > remote. (The client needs to have the separate promisor remote > configured, for that to work.) Is it "it can be better", or is it "it is always better"? Pick an example that you can say the latter to make your example more convincing. Repository S borrows from its "promisor" X, and repository C which initially cloned from S borrows from its "promisor" S. Even if C wants an object in order to fill in the gap in its object graph, S may not have it (and S itself may have no need for that object), and in such a case, bypassing S and let C go directly to X would make sense. > Another example could be a server where some objects have been > corrupted or deleted. It could still be useful for clients who could > get those objects from another source, like perhaps a different > client, to be able to fetch or clone from the server. > > To configure what a server should do if there are such missing > objects, let's teach `git upload-pack` a new > `uploadpack.missingAction` configuration variable. > > This new missing-action works in a similar way as what `git rev-list` > and `git pack-objects` already support with their > `--missing=<missing-action>` command line options. In fact when > `uploadpack.missingAction` is set to something different than "error", > `git upload-pack` will just pass the corresponding > `--missing=<missing-action>` to `git pack-objects`. > > So this new missing-action has the same limitations as > `git pack-objects --missing=<missing-action>`. Especially when not > using promisor remotes, 'allow-any' works only for blobs. > > Another limitation comes from `git upload-pack` itself which requires > setting `GIT_NO_LAZY_FETCH` to 0 since 7b70e9efb1 (upload-pack: > disable lazy-fetching by default, 2024-04-16) for lazy fetching from > a promisor remote to work on the server side. I am puzzled by this new option. It feels utterly irresponsible to give an option to set up a server that essentially declares: I'll serve objects you ask me as best efforts basis, the pack stream I'll give you may not have all objects you asked for and missing some objects, and when I do so, I am not telling you which objects I omitted. How do you ensure that a response with an incomplete pack data would not corrupt the repository when the sending side configures upload-pack with this option? How does the receiving end know which objects it needs to ask from elsewhere? Or is the data integrity of the receiving repository is the responsibility of the receiving user that talks with such a server? If that is the case, I am not sure if I want to touch such a feature even with 10ft pole. Is there anything the sender can do but does not do to help the receiver locate where to fetch these missing objects to fill the "unfilled promises"? For example, the sending side _could_ say that "Sorry, I do not have all objects that you asked me to---but you could try these other repositories".
On Wed, May 15, 2024 at 7:09 PM Junio C Hamano <gitster@pobox.com> wrote: > > Christian Couder <christian.couder@gmail.com> writes: > > > From: Christian Couder <chriscool@tuxfamily.org> > > > > In case some objects are missing from a server, it might still be > > useful to be able to fetch or clone from it if the client already has > > the missing objects or can get them in some way. > > Be more assertive. We do not want to add a new feature randomly > only because it _might_ be useful to somebody in a strange and > narrow use case that _might_ exist. Ok, I have changed "it might still be useful" to "it is sometimes useful" in the v3 I just sent. > > For example, in case both the server and the client are using a > > separate promisor remote that contain some objects, it can be better > > if the server doesn't try to send such objects back to the client, but > > instead let the client get those objects separately from the promisor > > remote. (The client needs to have the separate promisor remote > > configured, for that to work.) > > Is it "it can be better", or is it "it is always better"? Pick an > example that you can say the latter to make your example more > convincing. > > Repository S borrows from its "promisor" X, and repository C which > initially cloned from S borrows from its "promisor" S. Even if C > wants an object in order to fill in the gap in its object graph, S > may not have it (and S itself may have no need for that object), and > in such a case, bypassing S and let C go directly to X would make > sense. Ok, I use your example above in v3. > I am puzzled by this new option. > > It feels utterly irresponsible to give an option to set up a server > that essentially declares: I'll serve objects you ask me as best > efforts basis, the pack stream I'll give you may not have all > objects you asked for and missing some objects, and when I do so, I > am not telling you which objects I omitted. I don't think it's irresponsible. The client anyways checks that it got something usable in the same way as it does when it performs a partial fetch or clone. The fetch or clone fails if that's not the case. For example if the checkout part of a clone needs some objects but cannot get them, the whole clone fails. > How do you ensure that a response with an incomplete pack data would > not corrupt the repository when the sending side configures > upload-pack with this option? How does the receiving end know which > objects it needs to ask from elsewhere? Git already has support for multiple promisor remotes. When a repo is configured with 2 promisor remotes, let's call them A and B, then A cannot guarantee that B has all the objects that are missing in A. And it's the same for B, it cannot guarantee A has all the objects missing in B. Also when fetching or cloning from A for example, then no list of missing objects is transfered. > Or is the data integrity of the receiving repository is the > responsibility of the receiving user that talks with such a server? Yes, it's the same as when a repo is configured with multiple promisor remotes. When using bundle-uri, it's also the responsibility of the receiving user to check that the bundle it gets from a separate endpoint is correct. Also when only large blobs are on remotes except for the main remote, then, after cloning or fetching from the main remote, the client knows the hashes of the objects it should get from the other remotes. So there is still data integrity. > If that is the case, I am not sure if I want to touch such a feature > even with 10ft pole. To give you some context, at GitLab we have a very large part of the disk space on our servers taken by large blobs which often don't compress well (see https://gitlab.com/gitlab-org/gitaly/-/issues/5699#note_1794464340). It makes a lot of sense for us at GitLab to try to move those large blobs to some cheaper separate storage. With repack --filter=... we can put large blobs on a separate path, but it has some drawbacks: - they are still part of a packfile, which means that storing and accessing the objects is expensive in terms of CPU and memory (large binary files are often already compressed and might not delta well with other objects), - mounting object storage on a machine might not be easy or might not perform as well as using an object storage API. So using a separate remote along with a remote helper for large blobs makes sense. And when a client is cloning, it makes sense for a regular Git server, to let the client fetch the large blobs directly from a remote that has them. > Is there anything the sender can do but does not do to help the > receiver locate where to fetch these missing objects to fill the > "unfilled promises"? > > For example, the sending side _could_ say that "Sorry, I do not have > all objects that you asked me to---but you could try these other > repositories". In the cover letter of this v2 and the v3, I suggested the following: --- For example in case of a client cloning, something like the following is currently needed: GIT_NO_LAZY_FETCH=0 git clone -c remote.my_promisor.promisor=true \ -c remote.my_promisor.fetch="+refs/heads/*:refs/remotes/my_promisor/*" \ -c remote.my_promisor.url=<MY_PROMISOR_URL> \ --filter="blob:limit=5k" server But it would be nice if there was a capability for the client to say that it would like the server to give it information about the promisor that it could use, so that the user doesn't have to pass all the "remote.my_promisor.XXX" config options on the command like. (It would then be a bit similar to the bundle-uri feature where all the bundle related information comes from the server.) --- So yes we are thinking about adding such a way to "help the receiver locate where to fetch these missing objects" soon. We needed to start somewhere and we decided to start with this series, because this patch series is quite small and let us already experiment with offloading blobs to object storage (see https://gitlab.com/gitlab-org/gitaly/-/issues/5987). Also the client will anyway very likely store the information about where it can get the missing objects as a promisor remote configuration in its config. So after the clone, the resulting repo will very likely be very similar as what it would be with the clone command above.
Christian Couder <christian.couder@gmail.com> writes: >> Repository S borrows from its "promisor" X, and repository C which >> initially cloned from S borrows from its "promisor" S. Even if C >> wants an object in order to fill in the gap in its object graph, S >> may not have it (and S itself may have no need for that object), and >> in such a case, bypassing S and let C go directly to X would make >> sense. > ... >> >> It feels utterly irresponsible to give an option to set up a server >> that essentially declares: I'll serve objects you ask me as best >> efforts basis, the pack stream I'll give you may not have all >> objects you asked for and missing some objects, and when I do so, I >> am not telling you which objects I omitted. > > I don't think it's irresponsible. The client anyways checks that it > got something usable in the same way as it does when it performs a > partial fetch or clone. The fetch or clone fails if that's not the > case. For example if the checkout part of a clone needs some objects > but cannot get them, the whole clone fails. But then what can the repository C do after seeing such a failure? With the design, S does not even consult C to see if C knows about X. Without knowing that, it cannot safely decide that it does not have to send objects that can be obtained from X to C. Instead, S simply say "if C requests an object that I do not have, just ignore it and let C grab it from somewhere else". How would it not be an irresponsible design?
On Fri, May 24, 2024 at 11:51 PM Junio C Hamano <gitster@pobox.com> wrote: > > Christian Couder <christian.couder@gmail.com> writes: > > >> Repository S borrows from its "promisor" X, and repository C which > >> initially cloned from S borrows from its "promisor" S. Even if C > >> wants an object in order to fill in the gap in its object graph, S > >> may not have it (and S itself may have no need for that object), and > >> in such a case, bypassing S and let C go directly to X would make > >> sense. > > ... > >> > >> It feels utterly irresponsible to give an option to set up a server > >> that essentially declares: I'll serve objects you ask me as best > >> efforts basis, the pack stream I'll give you may not have all > >> objects you asked for and missing some objects, and when I do so, I > >> am not telling you which objects I omitted. > > > > I don't think it's irresponsible. The client anyways checks that it > > got something usable in the same way as it does when it performs a > > partial fetch or clone. The fetch or clone fails if that's not the > > case. For example if the checkout part of a clone needs some objects > > but cannot get them, the whole clone fails. > > But then what can the repository C do after seeing such a failure? It's basically the same as when a regular clone or a partial clone or a clone using bundle-uri fails or when using a regular bundle fails. If it failed because the remote was not properly configured, then that config can be fixed. If it fails because the remote doesn't have some objects, then maybe the missing objects can be transferred to the remote. And so on. The feature doesn't create any new kind of failure. In particular, when you use a partial clone, even a very simple one with a single remote, there is always the risk of not being able to get some missing objects as there is the risk of the remote being unreachable for some reason (like if you take a plane and don't have an internet connection, or if there is an outage on the server side). There are some added risks because the feature requires added configuration and it can be wrong like any configuration, and because there are 2 remotes instead of just one. But these are not new kinds of risks. These risks already exist if one uses multiple promisor remotes. > With the design, S does not even consult C to see if C knows about > X. If S is managed by a company like GitLab or GitHub, then S will certainly advertise, for example by showing a command that can easily be copy-pasted from the web page of the project onto the user's command line, some way for C to use X. In the cover letter I give the example of the following command that can be used (and advertised by S): GIT_NO_LAZY_FETCH=0 git clone -c remote.my_promisor.promisor=true \ -c remote.my_promisor.fetch="+refs/heads/*:refs/remotes/my_promisor/*" \ -c remote.my_promisor.url=<MY_PROMISOR_URL> \ --filter="blob:limit=5k" server I also agree in the cover letter that this is not the most user friendly clone command and I suggest that I could work on improving on that by saying: "it would be nice if there was a capability for the client to say that it would like the server to give it information about the promisor that it could use, so that the user doesn't have to pass all the "remote.my_promisor.XXX" config options on the command like." and by saying that this could be added later. If you think that such a capability should definitely be part of this work, for example because it wouldn't be sane to require users to use such a long and complex command and it could avoid difficult to debug failures, then I would be willing to work on this and add it to this series. > Without knowing that, it cannot safely decide that it does not > have to send objects that can be obtained from X to C. In the above command C is asking for a partial clone, as it uses a --filter option. This means that C knows very well that it might not get from S all the objects needed for a complete object graph. So why can't S safely decide not to send some objects to C? Why would it be Ok if C wanted a partial clone but didn't want to get some objects from X at the same time, but would not be Ok if C wants the same partial clone but also with the possibility to get some of the objects from X right away? To me it seems less risky to ask for some objects from X right away. > Instead, S > simply say "if C requests an object that I do not have, just ignore > it and let C grab it from somewhere else". How would it not be an > irresponsible design? Again when using a regular partial clone omitting the same set of objects, C also requests some objects that S doesn't have. And this is not considered an issue or something irresponsible. It already works like this. And then C still has the possibility to configure X as a promisor remote and get missing objects from there. So why is it Ok when it's done in several steps but not in one?
Christian Couder <christian.couder@gmail.com> writes: > It's basically the same as when a regular clone or a partial clone or > a clone using bundle-uri fails or when using a regular bundle fails. > If it failed because the remote was not properly configured, then that > config can be fixed. If it fails because the remote doesn't have some > objects, then maybe the missing objects can be transferred to the > remote. And so on. > The feature doesn't create any new kind of failure. > "it would be nice if there was a capability for the client to say > that it would like the server to give it information about the > promisor that it could use, so that the user doesn't have to pass all > the "remote.my_promisor.XXX" config options on the command like." > > and by saying that this could be added later. Can such an update happen transparently, or does it need changes to end-user experience? It is of dubious value to leave the initial series be incomplete if the latter is the case. >> Without knowing that, it cannot safely decide that it does not >> have to send objects that can be obtained from X to C. > > In the above command C is asking for a partial clone, as it uses a > --filter option. This means that C knows very well that it might not > get from S all the objects needed for a complete object graph. Hmph. If C asks a partial clone and S is willing to be the promisor for C, S is essentially saying that it will serve C any objects on demand that are reachable from any object it served C in the past, forever, no? It might not get from S initially all the objects, but if it wants later, S promises to let C have them. > Again when using a regular partial clone omitting the same set of > objects, C also requests some objects that S doesn't have. And > this is not considered an issue or something irresponsible. It > already works like this. "S doesn't have" is because S has pruned objects that it shouldn't have in order to keep the promise it made earlier to C, right? If that is the case, I would very much say S is being irresponsible in that case. > And then C still has the possibility to configure X as a > promisor remote and get missing objects from there. So why is it Ok > when it's done in several steps but not in one? You are right that S can decide to unilaterally break the promise it made C, so this update is not making it any worse than giving users a broken implementation of promisor remotes. I wouldn't call it OK, though. If somebody identifies that even without this series, S can lead to repository corruption at C by pruning objects it does need to keep its promise to C, the next action I expect from healthy project is to try coming up with a mechanism to make it less likely that such a pruning happens by accident (e.g., by noticing allowAnySHA1InWant as a sign that the repository has promised others to serve anything that used to be reachable from anything it historically served, disabling repack "-d" and instead send the currently unreachable objects to an archived pack, and something equally silly like that). It certainly is not to add a new mechanism to make it even easier to configure S to break promised it made to C. So, I dunno.
On Tue, May 28, 2024 at 5:54 PM Junio C Hamano <gitster@pobox.com> wrote: > > Christian Couder <christian.couder@gmail.com> writes: > > "it would be nice if there was a capability for the client to say > > that it would like the server to give it information about the > > promisor that it could use, so that the user doesn't have to pass all > > the "remote.my_promisor.XXX" config options on the command like." > > > > and by saying that this could be added later. > > Can such an update happen transparently, or does it need changes to > end-user experience? It is of dubious value to leave the initial > series be incomplete if the latter is the case. I think adding the capability could happen transparently. I think the capability should be a generic one about the server suggesting some config options and perhaps some command line options, and the client agreeing (or not) to setting the config options and using the command line options that the server suggests. Of course the important thing when implementing this capability is to properly limit the config and command line options (and perhaps their values) that the client accepts to set and use. But I think it could be a generic capability that could be extended to other options in the future. That's why I think it could be a separate patch series. > >> Without knowing that, it cannot safely decide that it does not > >> have to send objects that can be obtained from X to C. > > > > In the above command C is asking for a partial clone, as it uses a > > --filter option. This means that C knows very well that it might not > > get from S all the objects needed for a complete object graph. > > Hmph. If C asks a partial clone and S is willing to be the promisor > for C, S is essentially saying that it will serve C any objects on > demand that are reachable from any object it served C in the past, > forever, no? It might not get from S initially all the objects, but > if it wants later, S promises to let C have them. This promise is not broken, as S can still get the missing objects from X and then pass them to C. There is even the following test case in the patch that shows that it works when uploadpack.missingAction is unset (and thus default to "error" which is the same as how things currently work): +test_expect_success "fetch without uploadpack.missingAction=allow -promisor" ' + git -C server config --unset uploadpack.missingAction && + + # Clone from server to create a client + GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \ + -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \ + -c remote.server2.url="file://$(pwd)/server2" \ + --no-local --filter="blob:limit=5k" server client && + test_when_finished "rm -rf client" && + + # Check that the largest object is not missing on the server anymore + check_missing_objects server 0 "" +' The only difference with the case when uploadpack.missingAction is set to "allow-promisor" is that then C gets the missing objects directly from X without S getting them first. This is why there is: + # Check that the largest object is still missing on the server + check_missing_objects server 1 "$oid" at the end of the test case when uploadpack.missingAction is set to "allow-promisor". This is to check that S didn't get the missing object (which means that the missing object went directly from X to C). So S keeps its promise to let C have any object through S if C wants. It's just that having large objects through S (instead of directly from X) is not a good idea because it copies the large objects to S first which uses up space (and memory and CPU when creating packfiles) on S while the objects are already on X where disk space is probably cheaper (and where less CPU and memory will be used to send the objects). > > Again when using a regular partial clone omitting the same set of > > objects, C also requests some objects that S doesn't have. And > > this is not considered an issue or something irresponsible. It > > already works like this. > > "S doesn't have" is because S has pruned objects that it shouldn't > have in order to keep the promise it made earlier to C, right? If > that is the case, I would very much say S is being irresponsible in > that case. Yes, S has pruned some objects, but these objects have been moved to X first and S has access to X, so S can still get the objects. So I don't understand why you say S shouldn't have pruned them and S is irresponsible. S is still capable of fulfilling the promise it made. And even if S was not capable of directly fulfilling the promise it made, it would still not be clear to me that S is irresponsible, as S and X can be seen as a single entity from the client point of view, and S and X together would still be capable of fulfilling the promise that was made to C. When bundle-uri is used, for example, the server is not fulfilling the promise alone. It needs the bundle server to work properly for the client to get everything. When using Git LFS, the server is not fulfilling the promise alone either. For things to work properly both the Git server and the LFS server have to work together. So if it's Ok for other features to require an additional server to fulfill the promise, why is it not Ok in the case of S + X? > > And then C still has the possibility to configure X as a > > promisor remote and get missing objects from there. So why is it Ok > > when it's done in several steps but not in one? > > You are right that S can decide to unilaterally break the promise it > made C, so this update is not making it any worse than giving users > a broken implementation of promisor remotes. I wouldn't call it OK, > though. I don't understand why you compare this to a "broken" implementation of promisor remotes. What could then be a non-broken one that would store large blobs on a separate server in your opinion? I am really interested in answers to this question. It's not a rhetorical one. > If somebody identifies that even without this series, S can lead to > repository corruption at C by pruning objects it does need to keep > its promise to C, Objects are not pruned if you consider S + X instead of just S. It's the same as when people are using Git LFS or bundle-uri. Nothing is really pruned on the server side, no promise is broken. Some objects are just moved to a separate server that is sometimes called to provide some objects on top of those the Git server still sends. > the next action I expect from healthy project is > to try coming up with a mechanism to make it less likely that such a > pruning happens by accident When people managing server S decide to move some objects to X and make S access X, it's not accidental pruning. But maybe you are talking about the case when uploadpack.missingAction is set to "allow-any" instead of "allow-promisor"? > (e.g., by noticing allowAnySHA1InWant as > a sign that the repository has promised others to serve anything > that used to be reachable from anything it historically served, > disabling repack "-d" and instead send the currently unreachable > objects to an archived pack, and something equally silly like that). > It certainly is not to add a new mechanism to make it even easier to > configure S to break promised it made to C. This new mechanism has some important benefits. Yes, there are risks, but it's our opinion at GitLab, with some data from GitLab servers (see the link I previously sent to a GitLab issue) to back it up, that the rewards are well worth the added risks. I agree that ideally a Git server should be able to handle every kind of object very well, but that's unfortunately not the case. Git LFS has some drawbacks, bundle-uri also has some drawbacks, but there are reasons why they exist. Multiple promisor remote is a solution that already exists in Git too. Yes, it also has some drawbacks, but they are in many ways similar to those of bundle-uri and Git LFS. Yes, this patch is making multiple promisor remotes easier to use, but it doesn't fundamentally change the risks associated with using multiple promisor remotes.
Christian Couder <christian.couder@gmail.com> writes: >> Hmph. If C asks a partial clone and S is willing to be the promisor >> for C, S is essentially saying that it will serve C any objects on >> demand that are reachable from any object it served C in the past, >> forever, no? It might not get from S initially all the objects, but >> if it wants later, S promises to let C have them. > > This promise is not broken, as S can still get the missing objects > from X and then pass them to C. There is even the following test case > in the patch that shows that it works when uploadpack.missingAction is > unset (and thus default to "error" which is the same as how things > currently work): And the whole point of that configuration is to make it easier for S to break that promise, no? uploadPack.missingAction is set at S and is not under control of C, right? > So S keeps its promise to let C have any object through S if C wants. > It's just that having large objects through S (instead of directly > from X) is not a good idea ... > ... > So if it's Ok for other features to require an additional server to > fulfill the promise, why is it not Ok in the case of S + X? I am questioning the design that does not give C any say in the decision if it is a good idea or not to ask S relay large objects. S unilaterally decides that it does not want to and does not serve such large objects to C, and without even checking with C if it can reach X to fetch directly, silently assuming that C will do so, right? It is quite different from the contract between C and S in the simpler world. > I don't understand why you compare this to a "broken" implementation > of promisor remotes. What could then be a non-broken one that would > store large blobs on a separate server in your opinion? I am really > interested in answers to this question. It's not a rhetorical one. You as S would tell C "I want you to go to X because I am not serving objects X and Y". Or at least make sure that C knows about X before deciding to omit what X ought to have. Not doing anything and silently assuming that C will get them from elsewhere is simply irresponsible, especially if C is not even told about X over the protocol, no?
On Sat, Jun 1, 2024 at 11:43 AM Junio C Hamano <gitster@pobox.com> wrote: > > Christian Couder <christian.couder@gmail.com> writes: > > I don't understand why you compare this to a "broken" implementation > > of promisor remotes. What could then be a non-broken one that would > > store large blobs on a separate server in your opinion? I am really > > interested in answers to this question. It's not a rhetorical one. > > You as S would tell C "I want you to go to X because I am not > serving objects X and Y". Or at least make sure that C knows about > X before deciding to omit what X ought to have. Ok, so if there was a way for S to suggest config options and perhaps also command line options to C (like I mentioned in a previous email), and if S would suggest C to configure X as a promisor remote and C would accept to do that, then it would be fine for the feature to work, right? Alternatively, if C would pass a new option called for example --known-promisor=X on top of all other options, then that could be Ok too?
Christian Couder <christian.couder@gmail.com> writes: > Alternatively, if C would pass a new option called for example > --known-promisor=X on top of all other options, then that could be Ok > too? Why do you need a new option "--known-promisor=X" in the first place? Doesn't it known by the repository already and you do not have to bother the end user to supply it from the command line, no? In any case, a protocol extension that lets S tell C that S wants C to fetch missing objects from X (which means that if C knows about X in its ".git/config" then there is no need for end-user interaction at all), or a protocol extension that C tells S that C is willing to see objects available from X omitted when S does not have them (again, this could be done by looking at ".git/config" at C, but there may be security implications???). Either would work OK and I offhand do not see much preference between the two. What is important is to ensure that such deliberate omission by S is safe and acceptable by C. Thanks. [Footnote] By the way, I briefly wondered: would "X" that is valid for "C" be always valid for "S" these days? I as "S" may prefer you as "C" to go ssh://gitlab.com/x.git but you may only be able to go https://gitlab.com/x.git for networking reasons. Silly situation like that may make communicating "X" between "S" and "C" a bit harder than we would like. Of course this is not a high priority issue as the user with such networking need are probably already using url.X.insteadof to rewrite the submodule paths recommended in .gitmodules after they clone a superproject, iow, even if it were a problem, it is a solved one. So, in short, yes, exchanging the known promisor "X" between "S" and "C", regardless of which side speaks first, as long as the negotiation makes sure both sides agree, would be a good direction to go, I would think.
diff --git a/Documentation/config/uploadpack.txt b/Documentation/config/uploadpack.txt index 16264d82a7..cd70e853b3 100644 --- a/Documentation/config/uploadpack.txt +++ b/Documentation/config/uploadpack.txt @@ -82,3 +82,12 @@ uploadpack.allowRefInWant:: is intended for the benefit of load-balanced servers which may not have the same view of what OIDs their refs point to due to replication delay. + +uploadpack.missingAction:: + If this option is set, `upload-pack` will call + linkgit:git-pack-objects[1] passing it the corresponding + `--missing=<missing-action>` command line option. See the + documentation for that option, to see the valid values and + their meaning. This could be useful if some objects are + missing on a server, but a client already has them or could + still get them from somewhere else. diff --git a/missing.c b/missing.c index ce3cf734a8..75036d2089 100644 --- a/missing.c +++ b/missing.c @@ -18,3 +18,19 @@ int parse_missing_action_value(const char *value) return -1; } + +const char *missing_action_to_string(enum missing_action action) +{ + switch (action) { + case MA_ERROR: + return "error"; + case MA_ALLOW_ANY: + return "allow-any"; + case MA_PRINT: + return "print"; + case MA_ALLOW_PROMISOR: + return "allow-promisor"; + default: + BUG("invalid missing action %d", action); + } +} diff --git a/missing.h b/missing.h index 1e378d6215..74753d7887 100644 --- a/missing.h +++ b/missing.h @@ -14,4 +14,6 @@ enum missing_action { */ int parse_missing_action_value(const char *value); +const char *missing_action_to_string(enum missing_action action); + #endif /* MISSING_H */ diff --git a/t/t5706-upload-pack-missing.sh b/t/t5706-upload-pack-missing.sh new file mode 100755 index 0000000000..0f39c0ddd0 --- /dev/null +++ b/t/t5706-upload-pack-missing.sh @@ -0,0 +1,125 @@ +#!/bin/sh + +test_description='handling of missing objects in upload-pack' + +TEST_PASSES_SANITIZE_LEAK=true +. ./test-lib.sh + +# Setup the repository with three commits, this way HEAD is always +# available and we can hide commit 1 or 2. +test_expect_success 'setup: create "template" repository' ' + git init template && + test_commit -C template 1 && + test_commit -C template 2 && + test_commit -C template 3 && + test-tool genrandom foo 10240 >template/foo && + git -C template add foo && + git -C template commit -m foo +' + +# A bare repo will act as a server repo with unpacked objects. +test_expect_success 'setup: create bare "server" repository' ' + git clone --bare --no-local template server && + mv server/objects/pack/pack-* . && + packfile=$(ls pack-*.pack) && + git -C server unpack-objects --strict <"$packfile" +' + +# Fetching with 'uploadpack.missingAction=allow-any' only works with +# blobs, as `git pack-objects --missing=allow-any` fails if a missing +# object is not a blob. +test_expect_success "fetch with uploadpack.missingAction=allow-any" ' + oid="$(git -C server rev-parse HEAD:1.t)" && + oid_path="$(test_oid_to_path $oid)" && + path="server/objects/$oid_path" && + + mv "$path" "$path.hidden" && + test_when_finished "mv $path.hidden $path" && + + git init client && + test_when_finished "rm -rf client" && + + # Client needs the missing objects to be available somehow + client_path="client/.git/objects/$oid_path" && + mkdir -p $(dirname "$client_path") && + cp "$path.hidden" "$client_path" && + + test_must_fail git -C client fetch ../server && + git -C server config uploadpack.missingAction error && + test_must_fail git -C client fetch ../server && + git -C server config uploadpack.missingAction allow-any && + git -C client fetch ../server && + git -C server config --unset uploadpack.missingAction +' + +check_missing_objects () { + git -C "$1" rev-list --objects --all --missing=print > all.txt && + sed -n "s/^\?\(.*\)/\1/p" <all.txt >missing.txt && + test_line_count = "$2" missing.txt && + test "$3" = "$(cat missing.txt)" +} + +test_expect_success "setup for testing uploadpack.missingAction=allow-promisor" ' + # Create another bare repo called "server2" + git init --bare server2 && + + # Copy the largest object from server to server2 + obj="HEAD:foo" && + oid="$(git -C server rev-parse $obj)" && + oid_path="$(test_oid_to_path $oid)" && + path="server/objects/$oid_path" && + path2="server2/objects/$oid_path" && + mkdir -p $(dirname "$path2") && + cp "$path" "$path2" && + + # Repack everything first + git -C server -c repack.writebitmaps=false repack -a -d && + + # Repack without the largest object and create a promisor pack on server + git -C server -c repack.writebitmaps=false repack -a -d \ + --filter=blob:limit=5k --filter-to="$(pwd)" && + promisor_file=$(ls server/objects/pack/*.pack | sed "s/\.pack/.promisor/") && + > "$promisor_file" && + + # Check that only one object is missing on the server + check_missing_objects server 1 "$oid" && + + # Configure server2 as promisor remote for server + git -C server remote add server2 "file://$(pwd)/server2" && + git -C server config remote.server2.promisor true && + + git -C server2 config uploadpack.allowFilter true && + git -C server2 config uploadpack.allowAnySHA1InWant true && + git -C server config uploadpack.allowFilter true && + git -C server config uploadpack.allowAnySHA1InWant true +' + +test_expect_success "fetch with uploadpack.missingAction=allow-promisor" ' + git -C server config uploadpack.missingAction allow-promisor && + + # Clone from server to create a client + GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \ + -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \ + -c remote.server2.url="file://$(pwd)/server2" \ + --no-local --filter="blob:limit=5k" server client && + test_when_finished "rm -rf client" && + + # Check that the largest object is still missing on the server + check_missing_objects server 1 "$oid" +' + +test_expect_success "fetch without uploadpack.missingAction=allow-promisor" ' + git -C server config --unset uploadpack.missingAction && + + # Clone from server to create a client + GIT_NO_LAZY_FETCH=0 git clone -c remote.server2.promisor=true \ + -c remote.server2.fetch="+refs/heads/*:refs/remotes/server2/*" \ + -c remote.server2.url="file://$(pwd)/server2" \ + --no-local --filter="blob:limit=5k" server client && + test_when_finished "rm -rf client" && + + # Check that the largest object is not missing on the server anymore + check_missing_objects server 0 "" +' + +test_done diff --git a/upload-pack.c b/upload-pack.c index 902144b9d3..f114f92197 100644 --- a/upload-pack.c +++ b/upload-pack.c @@ -29,6 +29,8 @@ #include "write-or-die.h" #include "json-writer.h" #include "strmap.h" +#include "missing.h" +#include "object-file.h" /* Remember to update object flag allocation in object.h */ #define THEY_HAVE (1u << 11) @@ -96,6 +98,8 @@ struct upload_pack_data { const char *pack_objects_hook; + enum missing_action missing_action; + unsigned stateless_rpc : 1; /* v0 only */ unsigned no_done : 1; /* v0 only */ unsigned daemon_mode : 1; /* v0 only */ @@ -315,6 +319,9 @@ static void create_pack_file(struct upload_pack_data *pack_data, strvec_push(&pack_objects.args, "--delta-base-offset"); if (pack_data->use_include_tag) strvec_push(&pack_objects.args, "--include-tag"); + if (pack_data->missing_action) + strvec_pushf(&pack_objects.args, "--missing=%s", + missing_action_to_string(pack_data->missing_action)); if (pack_data->filter_options.choice) { const char *spec = expand_list_objects_filter_spec(&pack_data->filter_options); @@ -1371,6 +1378,18 @@ static int upload_pack_config(const char *var, const char *value, precomposed_unicode = git_config_bool(var, value); } else if (!strcmp("transfer.advertisesid", var)) { data->advertise_sid = git_config_bool(var, value); + } else if (!strcmp("uploadpack.missingaction", var)) { + int res = parse_missing_action_value(value); + if (res < 0 || (res != MA_ERROR && + res != MA_ALLOW_ANY && + res != MA_ALLOW_PROMISOR)) + die(_("invalid value for '%s': '%s'"), var, value); + /* Allow fetching only from promisor remotes */ + if (res == MA_ALLOW_PROMISOR) + fetch_if_missing = 1; + if (res == MA_ALLOW_ANY) + fetch_if_missing = 0; + data->missing_action = res; } if (parse_object_filter_config(var, value, ctx->kvi, data) < 0)