Message ID | pull.1206.v2.git.git.1644372606.gitgitgadget@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | repack: add --filter= | expand |
Hi John, On Wed, 9 Feb 2022 at 02:41, John Cai via GitGitGadget <gitgitgadget@gmail.com> wrote: > > This patch series makes partial clone more useful by making it possible to > run repack to remove objects from a repository (replacing it with promisor > objects). This is useful when we want to offload large blobs from a git > server onto another git server, or even use an http server through a remote > helper. > > In [A], a --refilter option on fetch and fetch-pack is being discussed where > either a less restrictive or more restrictive filter can be used. In the > more restrictive case, the objects that already exist will not be deleted. > But, one can imagine that users might want the ability to delete objects > when they apply a more restrictive filter in order to save space, and this > patch series would also allow that. This all makes sense to me, and the implementation is remarkably short - gluing together capabilities that are already there, and writing tests. *But*, running `repack --filter` drops objects from the object db. That seems like a capability Git shouldn't idly expose without people understanding the consequences - mostly that they really have another copy elsewhere or they will lose data, and it won't necessarily be obvious for a long time. Otherwise it is a footgun. I don't know whether that is just around naming (--delete-filter / --drop-filter / --expire-filter ?), and/or making the documentation very explicit that this isn't so much "omitting certain objects from a packfile" as irretrievably deleting objects. Rob :)
Hi Rob, glad these two efforts dovetail nicely! On 16 Feb 2022, at 10:39, Robert Coup wrote: > Hi John, > > On Wed, 9 Feb 2022 at 02:41, John Cai via GitGitGadget > <gitgitgadget@gmail.com> wrote: >> >> This patch series makes partial clone more useful by making it possible to >> run repack to remove objects from a repository (replacing it with promisor >> objects). This is useful when we want to offload large blobs from a git >> server onto another git server, or even use an http server through a remote >> helper. >> >> In [A], a --refilter option on fetch and fetch-pack is being discussed where >> either a less restrictive or more restrictive filter can be used. In the >> more restrictive case, the objects that already exist will not be deleted. >> But, one can imagine that users might want the ability to delete objects >> when they apply a more restrictive filter in order to save space, and this >> patch series would also allow that. > > This all makes sense to me, and the implementation is remarkably short - > gluing together capabilities that are already there, and writing tests. > > *But*, running `repack --filter` drops objects from the object db. > That seems like > a capability Git shouldn't idly expose without people understanding the > consequences - mostly that they really have another copy elsewhere or they > will lose data, and it won't necessarily be obvious for a long time. Otherwise > it is a footgun. Yes, great point. I think there was concern from Stolee around this as well. > > I don't know whether that is just around naming (--delete-filter / > --drop-filter / > --expire-filter ?), and/or making the documentation very explicit that > this isn't so > much "omitting certain objects from a packfile" as irretrievably > deleting objects. Yeah, making the name very clear (I kind of like --delete-filter) would certainly help. Also, to have more protection we can either 1. add a config value that needs to be set to true for repack to remove objects (repack.allowDestroyFilter). 2. --filter is dry-run by default and prints out objects that would have been removed, and it has to be combined with another flag --destroy in order for it to actually remove objects from the odb. > > Rob :)
On Wed, Feb 16, 2022 at 04:07:14PM -0500, John Cai wrote: > > I don't know whether that is just around naming (--delete-filter / > > --drop-filter / > > --expire-filter ?), and/or making the documentation very explicit that > > this isn't so > > much "omitting certain objects from a packfile" as irretrievably > > deleting objects. > > Yeah, making the name very clear (I kind of like --delete-filter) would certainly help. > Also, to have more protection we can either > > 1. add a config value that needs to be set to true for repack to remove > objects (repack.allowDestroyFilter). > > 2. --filter is dry-run by default and prints out objects that would have been removed, > and it has to be combined with another flag --destroy in order for it to actually remove > objects from the odb. I share the same concern as Robert and Stolee do. But I think this issue goes deeper than just naming. Even if we called this `git repack --delete-filter` and only ran it with `--i-know-what-im-doing` flag, we would still be leaving repository corruption on the table, just making it marginally more difficult to achieve. I'm not familiar enough with the proposal to comment authoritatively, but it seems like we should be verifying that there is a promisor remote which promises any objects that we are about to filter out of the repository. I think that this is basically what `pack-objects`'s `--missing=allow-promisor` does, though I don't think that's the right tool for this job, either. Because we pack-objects also knows the object filter, by the time we are ready to construct a pack, we're traversing the filtered list of objects. So we don't even bother to call show_object (or, in this case, builtin/pack-objects.c::show_objecT__ma_allow_promisor) on them. So I wonder what your thoughts are on having pack-objects only allow an object to get "filtered out" if a copy of it is promised by some promisor remote. Alternatively, and perhaps a more straight-forward option might be to have `git repack` look at any objects that exist in a pack we're about to delete, but don't exist in any of the packs we are going to leave around, and make sure that any of those objects are either unreachable or exist on a promisor remote. But as it stands right now, I worry that this feature is too easily misused and could result in unintended repository corruption. I think verifying that that any objects we're about to delete exist somewhere should make this safer to use, though even then, I think we're still open to a TOCTOU race whereby the promisor has the objects when we're about to delete them (convincing Git that deleting those objects is OK to do) but gets rid of them after objects have been deleted from the local copy (leaving no copies of the object around). So, I don't know exactly what the right path forward is. But I'm curious to get your thoughts on the above. Thanks, Taylor
Hi, On Mon, 21 Feb 2022 at 03:11, Taylor Blau <me@ttaylorr.com> wrote: > > we would still be leaving repository > corruption on the table, just making it marginally more difficult to > achieve. While reviewing John's patch I initially wondered if a better approach might be something like `git repack -a -d --exclude-stdin`, passing a list of specific objects to exclude from the new pack (sourced from rev-list via a filter, etc). To me this seems like a less dangerous approach, but my concern was it doesn't use the existing filter capabilities of pack-objects, and we end up generating and passing around a huge list of oids. And of course any mistakes in the list generation aren't visible until it's too late. I also wonder whether there's a race condition if the repository gets updated? If you're moving large objects out in advance, then filtering the remainder there's nothing to stop a new large object being pushed between those two steps and getting dropped. My other idea, which is growing on me, is whether repack could generate two valid packs: one for the included objects via the filter (as John's change does now), and one containing the filtered-out objects. `git repack -a -d --split-filter=<filter>` Then a user could then move/extract the second packfile to object storage, but there'd be no way to *accidentally* corrupt the repository by using a bad option. With this approach the race condition above goes away too. $ git repack -a -d -q --split-filter=blob:limit=1m pack-37b7443e3123549a2ddee31f616ae272c51cae90 pack-10789d94fcd99ffe1403b63b167c181a9df493cd First pack identifier being the objects that match the filter (ie: commits/trees/blobs <1m), and the second pack identifier being the objects that are excluded by the filter (blobs >1m). An astute --i-know-what-im-doing reader could spot that you could just delete the second packfile and achieve the same outcome as the current proposed patch, subject to being confident the race condition hadn't happened to you. Thanks, Rob :)
On Mon, Feb 21, 2022 at 03:38:03PM +0000, Robert Coup wrote: > Hi, > > On Mon, 21 Feb 2022 at 03:11, Taylor Blau <me@ttaylorr.com> wrote: > > > > we would still be leaving repository > > corruption on the table, just making it marginally more difficult to > > achieve. > > While reviewing John's patch I initially wondered if a better approach > might be something like `git repack -a -d --exclude-stdin`, passing a > list of specific objects to exclude from the new pack (sourced from > rev-list via a filter, etc). To me this seems like a less dangerous > approach, but my concern was it doesn't use the existing filter > capabilities of pack-objects, and we end up generating and passing > around a huge list of oids. And of course any mistakes in the list > generation aren't visible until it's too late. Yeah; I think the most elegant approach would have pack-objects do as much work as possible, and have repack be in charge of coordinating what all the pack-objects invocation(s) have to do. > I also wonder whether there's a race condition if the repository gets > updated? If you're moving large objects out in advance, then filtering > the remainder there's nothing to stop a new large object being pushed > between those two steps and getting dropped. Yeah, we will want to make sure that we're operating on a consistent view of the repository. If this is all done in-process, it won't be a problem since we'll capture an atomic snapshot of the reference states once. If this is done across multiple processes, we'll need to make sure we're passing around that snapshot where appropriate. See the `--refs-snapshot`-related code in git-repack for when we write a multi-pack bitmap for an example of the latter. > My other idea, which is growing on me, is whether repack could > generate two valid packs: one for the included objects via the filter > (as John's change does now), and one containing the filtered-out > objects. `git repack -a -d --split-filter=<filter>` Then a user could > then move/extract the second packfile to object storage, but there'd > be no way to *accidentally* corrupt the repository by using a bad > option. With this approach the race condition above goes away too. > > $ git repack -a -d -q --split-filter=blob:limit=1m > pack-37b7443e3123549a2ddee31f616ae272c51cae90 > pack-10789d94fcd99ffe1403b63b167c181a9df493cd > > First pack identifier being the objects that match the filter (ie: > commits/trees/blobs <1m), and the second pack identifier being the > objects that are excluded by the filter (blobs >1m). I like this idea quite a bit. We also have a lot of existing tools that would make an implementation fairly lightweight, namely pack-objects' `--stdin-packs` mode. Using that would look something like first having `repack` generate the filtered pack first, remembering its name [1]. After that, we would run `pack-objects` again, this time with `--stdin-packs`, where the positive packs are the ones we're going to delete, and the negative pack(s) is/are the filtered one generated in the last step. The second invocation would leave us with a single pack which represents all of the objects in packs we are about to delete, skipping any objects that are in the filtered pack we just generated. In other words, it would leave the repository with two packs: one with all of the objects that met the filter criteria, and one with all objects that don't meet the filter criteria. A client could then upload the "doesn't meet the filter criteria" pack off elsewhere, and then delete it locally. (I'm assuming this last part in particular is orchestrated by some other command, and we aren't encouraging users to run "rm" inside of .git/objects/pack!) > An astute --i-know-what-im-doing reader could spot that you could just > delete the second packfile and achieve the same outcome as the current > proposed patch, subject to being confident the race condition hadn't > happened to you. Yeah, and I think this goes to my joking remark in the last paragraph. If we allow users to delete packs at will, all bets are off regardless of how safely we generate those packs. But I think splitting the repository into two packs and _then_ dealing with one of them separately as opposed to deleting objects which don't meet the filter criteria immediately is moving in a safer direction. > Thanks, > Rob :) Thanks, Taylor [1]: Optionally "name_s_", if we passed the `--max-pack-size` option to `git pack-objects`, which we can trigger via a `git repack` option of the same name.
On Mon, Feb 21, 2022 at 4:11 AM Taylor Blau <me@ttaylorr.com> wrote: > > On Wed, Feb 16, 2022 at 04:07:14PM -0500, John Cai wrote: > > > I don't know whether that is just around naming (--delete-filter / > > > --drop-filter / > > > --expire-filter ?), and/or making the documentation very explicit that > > > this isn't so > > > much "omitting certain objects from a packfile" as irretrievably > > > deleting objects. > > > > Yeah, making the name very clear (I kind of like --delete-filter) would certainly help. I am ok with making the name and doc very clear that it deletes objects and they might be lost if they haven't been saved elsewhere first. > > Also, to have more protection we can either > > > > 1. add a config value that needs to be set to true for repack to remove > > objects (repack.allowDestroyFilter). I don't think it's of much value. We don't have such config values for other possibly destructive operations. > > 2. --filter is dry-run by default and prints out objects that would have been removed, > > and it has to be combined with another flag --destroy in order for it to actually remove > > objects from the odb. I am not sure it's of much value either compared to naming it --filter-destroy. It's likely to just make things more difficult for users to understand. > I share the same concern as Robert and Stolee do. But I think this issue > goes deeper than just naming. > > Even if we called this `git repack --delete-filter` and only ran it with > `--i-know-what-im-doing` flag, we would still be leaving repository > corruption on the table, just making it marginally more difficult to > achieve. My opinion on this is that the promisor object mechanism assumes by design that some objects are outside a repo, and that this repo shouldn't care much about these objects possibly being corrupted. It's the same for git LFS. As only a pointer file is stored in Git and the real file is stored elsewhere, the Git repo doesn't care by design about possible corruption of the real file. I am not against a name and some docs that strongly state that users should be very careful when using such a command, but otherwise I think such a command is perfectly ok. We have other commands that by design could lead to some objects or data being lost. > I'm not familiar enough with the proposal to comment authoritatively, > but it seems like we should be verifying that there is a promisor remote > which promises any objects that we are about to filter out of the > repository. I think it could be a follow up mode that could be useful and safe, but there should be no requirement for such a mode. In some cases you know very much what you want and you don't want checks. For example if you have taken proper care to transfer large objects to another remote, you might just not need other possibly expansive checks. [...] > But as it stands right now, I worry that this feature is too easily > misused and could result in unintended repository corruption. Are you worrying about the UI or about what it does? I am ok with improving the UI, but I think what it does is reasonable. > I think verifying that that any objects we're about to delete exist > somewhere should make this safer to use, though even then, I think we're > still open to a TOCTOU race whereby the promisor has the objects when > we're about to delete them (convincing Git that deleting those objects > is OK to do) but gets rid of them after objects have been deleted from > the local copy (leaving no copies of the object around). Possible TOCTOU races are a good reason why something with no check is perhaps a better goal for now.
On Mon, Feb 21, 2022 at 10:10:15PM +0100, Christian Couder wrote: > > > Also, to have more protection we can either > > > > > > 1. add a config value that needs to be set to true for repack to remove > > > objects (repack.allowDestroyFilter). > > I don't think it's of much value. We don't have such config values for > other possibly destructive operations. > > > > 2. --filter is dry-run by default and prints out objects that would have been removed, > > > and it has to be combined with another flag --destroy in order for it to actually remove > > > objects from the odb. > > I am not sure it's of much value either compared to naming it > --filter-destroy. It's likely to just make things more difficult for > users to understand. On this and the above, I agree with Christian. > > I share the same concern as Robert and Stolee do. But I think this issue > > goes deeper than just naming. > > > > Even if we called this `git repack --delete-filter` and only ran it with > > `--i-know-what-im-doing` flag, we would still be leaving repository > > corruption on the table, just making it marginally more difficult to > > achieve. > > My opinion on this is that the promisor object mechanism assumes by > design that some objects are outside a repo, and that this repo > shouldn't care much about these objects possibly being corrupted. For what it's worth, I am fine having a mode of repack which allows us to remove objects that we know are stored by a promisor remote. But this series doesn't do that, so users could easily run `git repack -d --filter=...` and find that they have irrecoverably corrupted their repository. I think that there are some other reasonable directions, though. One which Robert and I discussed was making it possible to split a repository into two packs, one which holds objects that match some `--filter` criteria, and one which holds the objects that don't match that filter. Another option would be to prune the repository according to objects that are already made available by a promisor remote. An appealing quality about the above two directions is that the first doesn't actually remove any objects, just makes it easier to push a whole pack of unwanted objects off to a promsior remote. The second prunes the repository according to objects that are already made available by the promisor remote. (Yes, there is a TOCTOU race there, too, but it's the same prune-while-pushing race that Git already has today). > I am not against a name and some docs that strongly state that users > should be very careful when using such a command, but otherwise I > think such a command is perfectly ok. We have other commands that by > design could lead to some objects or data being lost. I can think of a handful of ways to remove objects which are unreachable from a repository, but I am not sure we have any ways to remove objects which are reachable. > > But as it stands right now, I worry that this feature is too easily > > misused and could result in unintended repository corruption. > > Are you worrying about the UI or about what it does? > > I am ok with improving the UI, but I think what it does is reasonable. I am more worried about the proposal's functionality than its UI, hopefully my concerns there are summarized above. Thanks, Taylor
On Mon, Feb 21, 2022 at 10:42 PM Taylor Blau <me@ttaylorr.com> wrote: > > On Mon, Feb 21, 2022 at 10:10:15PM +0100, Christian Couder wrote: > > > > Also, to have more protection we can either > > > > > > > > 1. add a config value that needs to be set to true for repack to remove > > > > objects (repack.allowDestroyFilter). > > > > I don't think it's of much value. We don't have such config values for > > other possibly destructive operations. > > > > > > 2. --filter is dry-run by default and prints out objects that would have been removed, > > > > and it has to be combined with another flag --destroy in order for it to actually remove > > > > objects from the odb. > > > > I am not sure it's of much value either compared to naming it > > --filter-destroy. It's likely to just make things more difficult for > > users to understand. > > On this and the above, I agree with Christian. > > > > I share the same concern as Robert and Stolee do. But I think this issue > > > goes deeper than just naming. > > > > > > Even if we called this `git repack --delete-filter` and only ran it with > > > `--i-know-what-im-doing` flag, we would still be leaving repository > > > corruption on the table, just making it marginally more difficult to > > > achieve. > > > > My opinion on this is that the promisor object mechanism assumes by > > design that some objects are outside a repo, and that this repo > > shouldn't care much about these objects possibly being corrupted. > > For what it's worth, I am fine having a mode of repack which allows us > to remove objects that we know are stored by a promisor remote. But this > series doesn't do that, so users could easily run `git repack -d > --filter=...` and find that they have irrecoverably corrupted their > repository. In some cases we just know the objects we are removing are stored by a promisor remote or are replicated on different physical machines or both, so you should be fine with this. If you are not fine with this because sometimes a user might use it without knowing, then why are you ok with commands deleting refs not checking that there isn't a regular repack removing dangling objects? Also note that people who want to remove objects using a filter can already do it by cloning with a filter and then replacing the original packs with the packs from the clone. So refusing this new feature is just making things more cumbersome. > I think that there are some other reasonable directions, though. One > which Robert and I discussed was making it possible to split a > repository into two packs, one which holds objects that match some > `--filter` criteria, and one which holds the objects that don't match > that filter. I am ok with someone implementing this feature, but if an option that actually deletes the filtered objects is rejected then such a feature will be used with some people just deleting one of the resulting packs (and they might get it wrong), so I don't think any real safety will be gained. > Another option would be to prune the repository according to objects > that are already made available by a promisor remote. If the objects have just been properly transferred to the promisor remote, the check will just waste resources. > An appealing quality about the above two directions is that the first > doesn't actually remove any objects, just makes it easier to push a > whole pack of unwanted objects off to a promsior remote. The second > prunes the repository according to objects that are already made > available by the promisor remote. (Yes, there is a TOCTOU race there, > too, but it's the same prune-while-pushing race that Git already has > today). > > > I am not against a name and some docs that strongly state that users > > should be very careful when using such a command, but otherwise I > > think such a command is perfectly ok. We have other commands that by > > design could lead to some objects or data being lost. > > I can think of a handful of ways to remove objects which are unreachable > from a repository, but I am not sure we have any ways to remove objects > which are reachable. Cloning with a filter already does that. It's by design in the promisor object and partial clone mechanisms that reachable objects are removed. Having more than one promisor remote, which is an existing mechanism, means that it's just wasteful to require all the remotes to have all the reachable objects, so how could people easily set up such remotes? Why make it unnecessarily hard and forbid a straightforward way?
Hi Christian, I think my objections may be based on a misunderstanding of John and your original proposal. From reading [1], it seemed to me like a required step of this proposal was to upload the objects you want to filter out ahead of time, and then run `git repack -ad --filter=...`. So my concerns thus far have been around the lack of cohesion between (1) the filter which describes the set of objects uploaded to the HTTP server, and (2) the filter used when re-filtering the repository. If (1) and (2) aren't inverses of each other, then in the case where (2) leaves behind an object which wasn't caught by (1), we have lost that object. If instead the server used in your script at [1] is a stand-in for an ordinary Git remote, that changes my thinking significantly. See below for more details: On Tue, Feb 22, 2022 at 06:11:11PM +0100, Christian Couder wrote: > > > > I share the same concern as Robert and Stolee do. But I think this issue > > > > goes deeper than just naming. > > > > > > > > Even if we called this `git repack --delete-filter` and only ran it with > > > > `--i-know-what-im-doing` flag, we would still be leaving repository > > > > corruption on the table, just making it marginally more difficult to > > > > achieve. > > > > > > My opinion on this is that the promisor object mechanism assumes by > > > design that some objects are outside a repo, and that this repo > > > shouldn't care much about these objects possibly being corrupted. > > > > For what it's worth, I am fine having a mode of repack which allows us > > to remove objects that we know are stored by a promisor remote. But this > > series doesn't do that, so users could easily run `git repack -d > > --filter=...` and find that they have irrecoverably corrupted their > > repository. > > In some cases we just know the objects we are removing are stored by a > promisor remote or are replicated on different physical machines or > both, so you should be fine with this. I am definitely OK with a convenient way to re-filter your repository locally so long as you know that the objects you are filtering out are available via some promisor remote. But perhaps I have misunderstood what this proposal is for. Reading through John's original cover letter and the link to your demo script, I understood that a key part of this was being able to upload the pack of objects you were about to filter out of your local copy to some server (not necessarily Git) over HTTP. My hesitation so far has been based on that understanding. Reading these patches, I don't see a mechanism to upload objects we're about to expunge to a promisor remote. But perhaps I'm misunderstanding: if you are instead assuming that the existing set of remotes can serve any objects that we deleted, and this is the way to delete them, then I am OK with that approach. But I think either way, I am missing some details in the original proposal that would have perhaps made it easier for me to understand what your goals are. In any case, this patch series doesn't seem to correctly set up a promisor remote for me, since doing the following on a fresh clone of git.git (after running "make"): $ bin-wrappers/git repack -ad --filter=blob:none $ bin-wrappers/git fsck results in many "missing blob" and "missing link" lines out of output. (FWIW, I think what's missing here is correctly setting up the affected remote(s) as promisors and indicating that we're now in a filtered setting when going from a full clone down to a partial one.) > If you are not fine with this because sometimes a user might use it > without knowing, then why are you ok with commands deleting refs not > checking that there isn't a regular repack removing dangling objects? I'm not sure I totally understand your question, but my general sense has been "because we typically make it difficult / impossible to remove reachable objects". Thanks, Taylor [1]: https://gitlab.com/chriscool/partial-clone-demo/-/blob/master/http-promisor/server_demo.txt#L47-52
Hi Taylor, On 21 Feb 2022, at 16:42, Taylor Blau wrote: > On Mon, Feb 21, 2022 at 10:10:15PM +0100, Christian Couder wrote: >>>> Also, to have more protection we can either >>>> >>>> 1. add a config value that needs to be set to true for repack to remove >>>> objects (repack.allowDestroyFilter). >> >> I don't think it's of much value. We don't have such config values for >> other possibly destructive operations. >> >>>> 2. --filter is dry-run by default and prints out objects that would have been removed, >>>> and it has to be combined with another flag --destroy in order for it to actually remove >>>> objects from the odb. >> >> I am not sure it's of much value either compared to naming it >> --filter-destroy. It's likely to just make things more difficult for >> users to understand. > > On this and the above, I agree with Christian. > >>> I share the same concern as Robert and Stolee do. But I think this issue >>> goes deeper than just naming. >>> >>> Even if we called this `git repack --delete-filter` and only ran it with >>> `--i-know-what-im-doing` flag, we would still be leaving repository >>> corruption on the table, just making it marginally more difficult to >>> achieve. >> >> My opinion on this is that the promisor object mechanism assumes by >> design that some objects are outside a repo, and that this repo >> shouldn't care much about these objects possibly being corrupted. > > For what it's worth, I am fine having a mode of repack which allows us > to remove objects that we know are stored by a promisor remote. But this > series doesn't do that, so users could easily run `git repack -d > --filter=...` and find that they have irrecoverably corrupted their > repository. > > I think that there are some other reasonable directions, though. One > which Robert and I discussed was making it possible to split a > repository into two packs, one which holds objects that match some > `--filter` criteria, and one which holds the objects that don't match > that filter. > > Another option would be to prune the repository according to objects > that are already made available by a promisor remote. Thanks for the discussion around the two packfile idea. Definitely interesting. However, I'm leaning towards the second option here where we ensure that objects that are about to be deleted can be retrieved via a promisor remote. That way we have an easy path to recovery. > > An appealing quality about the above two directions is that the first > doesn't actually remove any objects, just makes it easier to push a > whole pack of unwanted objects off to a promsior remote. The second > prunes the repository according to objects that are already made > available by the promisor remote. (Yes, there is a TOCTOU race there, > too, but it's the same prune-while-pushing race that Git already has > today). > >> I am not against a name and some docs that strongly state that users >> should be very careful when using such a command, but otherwise I >> think such a command is perfectly ok. We have other commands that by >> design could lead to some objects or data being lost. > > I can think of a handful of ways to remove objects which are unreachable > from a repository, but I am not sure we have any ways to remove objects > which are reachable. > >>> But as it stands right now, I worry that this feature is too easily >>> misused and could result in unintended repository corruption. >> >> Are you worrying about the UI or about what it does? >> >> I am ok with improving the UI, but I think what it does is reasonable. > > I am more worried about the proposal's functionality than its UI, > hopefully my concerns there are summarized above. > > Thanks, > Taylor
On Tue, Feb 22, 2022 at 01:52:09PM -0500, John Cai wrote: > > Another option would be to prune the repository according to objects > > that are already made available by a promisor remote. > > Thanks for the discussion around the two packfile idea. Definitely > interesting. However, I'm leaning towards the second option here > where we ensure that objects that are about to be deleted can be > retrieved via a promisor remote. That way we have an easy path to > recovery. Yeah, I think this may have all come from a potential misunderstanding I had with the original proposal. More of the details there can be found in [1]. But assuming that this proposal isn't about first offloading some objects to an auxiliary (non-Git) server, then I think refiltering into a single pack makes sense, because we trust the remote to still have any objects we deleted. (The snag I hit was that it seemed like your+Christian's proposal hinged on using _two_ filters, one to produce the set of objects you wanted to get rid of, and another to produce the set of objects you wanted to keep. The lack of cohesion between the two is what gave me pause, but it may not have been what either of you were thinking in the first place). Anyway, I'm not sure "spitting" a repository along a `--filter` into two packs is all that interesting of an idea, but it is something we could do if it became useful to you without writing too much new code (and instead leveraging `git pack-objects --stdin-packs`). Thanks, Taylor [1]: https://lore.kernel.org/git/YhUeUCIetu%2FaOu6k@nand.local/
Hi Christian, On Tue, 22 Feb 2022 at 17:11, Christian Couder <christian.couder@gmail.com> wrote: > > In some cases we just know the objects we are removing are stored by a > promisor remote or are replicated on different physical machines or > both, so you should be fine with this. From my point of view I think the goal here is great. > > Another option would be to prune the repository according to objects > > that are already made available by a promisor remote. > > If the objects have just been properly transferred to the promisor > remote, the check will just waste resources. As far as I can see this patch doesn't know or check that any of the filtered-out objects are held anywhere else... it simply applies a filter during repacking and the excluded objects are dropped. That's the aspect I have concerns about. Maybe an approach where you build/get/maintain a list of objects-I-definitely-have-elsewhere and pass it as an exclude list to repack would be a cleaner/safer/easier solution? If you're confident enough you don't need to check with the promisor remote then you can use a local list, or even something generated with `rev-list --filter=`. Thanks, Rob :)
Christian Couder <christian.couder@gmail.com> writes: >> For what it's worth, I am fine having a mode of repack which allows us >> to remove objects that we know are stored by a promisor remote. But this >> series doesn't do that, so users could easily run `git repack -d >> --filter=...` and find that they have irrecoverably corrupted their >> repository. > > In some cases we just know the objects we are removing are stored by a > promisor remote or are replicated on different physical machines or > both, so you should be fine with this. So, we need to decide if an object we have that is outside the narrowed filter definition was (and still is, but let's keep the assumption the whole lazy clone mechanism makes: promisor remotes will never shed objects that they once served) available at the promisor remote, but I suspect we have too little information to reliably do so. It is OK to assume that objects in existing packs taken from the promisor remotes and everything reachable from them (but missing from our object store) will be available to us from there. But if we see an object that is outside of _new_ filter spec (e.g. you fetched with "max 100MB", now you are refiltering with "max 50MB", narrowing the spec, and you need to decide for an object that weigh 70MB), can we tell if that can be retrieved from the promisor or is it unique to our repository until we push it out? I am not sure. For that matter, do we even have a way to compare if the new filter spec is a subset, a superset, or neither, of the original filter spec? > If you are not fine with this because sometimes a user might use it > without knowing, then why are you ok with commands deleting refs not > checking that there isn't a regular repack removing dangling objects? Sorry, I do not follow this argument. Your user may do "branch -D" because the branch deleted is no longer needed, which may mean that objects only reachable from the deleted branch are no longer needed. I do not see what repack has anything to do with that. As long as the filter spec does not change (in other words, before this series is applied), the repack that discards objects that are known to be reachable from objects in packs retrieved from promisor remote, the objects that are no longer reachable may be removed and that will not lose objects that we do not know to be retrievable from there (which is different from objects that we know are unretrievable). But with filter spec changing after the fact, I am not sure if that is safe. IOW, "commands deleting refs" may have been OK without this series, but this series may be what makes it not OK, no? Puzzled.
Thank you for everyone's feedback. Really appreciate the collaboration! On 23 Feb 2022, at 14:31, Junio C Hamano wrote: > Christian Couder <christian.couder@gmail.com> writes: > >>> For what it's worth, I am fine having a mode of repack which allows us >>> to remove objects that we know are stored by a promisor remote. But this >>> series doesn't do that, so users could easily run `git repack -d >>> --filter=...` and find that they have irrecoverably corrupted their >>> repository. >> >> In some cases we just know the objects we are removing are stored by a >> promisor remote or are replicated on different physical machines or >> both, so you should be fine with this. > > So, we need to decide if an object we have that is outside the > narrowed filter definition was (and still is, but let's keep the > assumption the whole lazy clone mechanism makes: promisor remotes > will never shed objects that they once served) available at the > promisor remote, but I suspect we have too little information to > reliably do so. It is OK to assume that objects in existing packs > taken from the promisor remotes and everything reachable from them > (but missing from our object store) will be available to us from > there. But if we see an object that is outside of _new_ filter spec > (e.g. you fetched with "max 100MB", now you are refiltering with > "max 50MB", narrowing the spec, and you need to decide for an object > that weigh 70MB), can we tell if that can be retrieved from the > promisor or is it unique to our repository until we push it out? I > am not sure. For that matter, do we even have a way to compare if > the new filter spec is a subset, a superset, or neither, of the > original filter spec? let me try to summarize (perhaps over simplify) the main concern folks have with this feature, so please correct me if I'm wrong! As a user, if I apply a filter that ends up deleting objects that it turns out do not exist anywhere else, then I have irrecoverably corrupted my repository. Before git allows me to delete objects from my repository, it should be pretty certain that I have path to recover those objects if I need to. Is that correct? It seems to me that, put another way, we don't want to give users too much rope to hang themselves. I can see why we would want to do this. In this case, there have been a couple of alternative ideas proposed throughout this thread that I think are viable and I wanted to get folks thoughts. 1. split pack file - (Rob gave this idea and Taylor provided some more detail on how using pack-objects would make it fairly straightforward to implement) when a user wants to apply a filter that removes objects from their repository, split the packfile into one containing objects that are filtered out, and another packfile with objects that remain. pros: simple to implement cons: does not address the question "how sure am I that the objects I want to filter out of my repository exist on a promsior remote?" 2. check the promisor remotes to see if they contain the objects that are about to get deleted. Only delete objects that we find on promisor remotes. pros: provides assurance that I have access to objects I am about to delete from a promsior remote. cons: more complex to implement. [*] Out of these two, I like 2 more for the aforementioned pros. * I am beginning to look into how fetches work and am still pretty new to the codebase so I don't know if this is even feasible, but I was thinking perhaps we could write a function that fetches with a --filter and create a promisor packfile containing promisor objects (this operaiton would have to somehow ignore the presence of the actual objects in the repository). Then, we would have a record of objects we have access to. Then, repack --filter can remove only the objects contained in this promisor packfile. > >> If you are not fine with this because sometimes a user might use it >> without knowing, then why are you ok with commands deleting refs not >> checking that there isn't a regular repack removing dangling objects? > > Sorry, I do not follow this argument. Your user may do "branch -D" > because the branch deleted is no longer needed, which may mean that > objects only reachable from the deleted branch are no longer needed. > I do not see what repack has anything to do with that. As long as > the filter spec does not change (in other words, before this series > is applied), the repack that discards objects that are known to be > reachable from objects in packs retrieved from promisor remote, the > objects that are no longer reachable may be removed and that will > not lose objects that we do not know to be retrievable from there > (which is different from objects that we know are unretrievable). > But with filter spec changing after the fact, I am not sure if that > is safe. IOW, "commands deleting refs" may have been OK without > this series, but this series may be what makes it not OK, no? > > Puzzled.
On Sat, Feb 26, 2022 at 11:01:46AM -0500, John Cai wrote: > let me try to summarize (perhaps over simplify) the main concern folks > have with this feature, so please correct me if I'm wrong! > > As a user, if I apply a filter that ends up deleting objects that it > turns out do not exist anywhere else, then I have irrecoverably > corrupted my repository. > > Before git allows me to delete objects from my repository, it should > be pretty certain that I have path to recover those objects if I need > to. > > Is that correct? It seems to me that, put another way, we don't want > to give users too much rope to hang themselves. I wrote about my concerns in some more detail in [1], but the thing I was most unclear on was how your demo script[2] was supposed to work. Namely, I wasn't sure if you had intended to use two separate filters to "re-filter" a repository, one to filter objects to be uploaded to a content store, and another to filter objects to be expunged from the repository. I have major concerns with that approach, namely that if each of the filters is not exactly the inverse of the other, then we will either upload too few objects, or delete too many. My other concern was around what guarantees we currently provide for a promisor remote. My understanding is that we expect an object which was received from the promisor remote to always be fetch-able later on. If that's the case, then I don't mind the idea of refiltering a repository, provided that you only need to specify a filter once. So the suggestion about splitting a repository into two packs was a potential way to mediate the "two filter" problem, since the two packs you get exactly correspond to the set of objects that match the filter, and the set of objects that _don't_ match the filter. In either case, I tried to use the patches in [1] and was able to corrupt my local repository (even when fetching from a remote that held onto the objects I had pruned locally). Thanks, Taylor [1]: https://lore.kernel.org/git/YhUeUCIetu%2FaOu6k@nand.local/ [2]: https://gitlab.com/chriscool/partial-clone-demo/-/blob/master/http-promisor/server_demo.txt#L47-52
Hi Taylor, (resending this because my email client misbehaved and set the mime type to html) On 26 Feb 2022, at 12:29, Taylor Blau wrote: > On Sat, Feb 26, 2022 at 11:01:46AM -0500, John Cai wrote: >> let me try to summarize (perhaps over simplify) the main concern folks >> have with this feature, so please correct me if I'm wrong! >> >> As a user, if I apply a filter that ends up deleting objects that it >> turns out do not exist anywhere else, then I have irrecoverably >> corrupted my repository. >> >> Before git allows me to delete objects from my repository, it should >> be pretty certain that I have path to recover those objects if I need >> to. >> >> Is that correct? It seems to me that, put another way, we don't want >> to give users too much rope to hang themselves. > > I wrote about my concerns in some more detail in [1], but the thing I > was most unclear on was how your demo script[2] was supposed to work. > > Namely, I wasn't sure if you had intended to use two separate filters to > "re-filter" a repository, one to filter objects to be uploaded to a > content store, and another to filter objects to be expunged from the > repository. I have major concerns with that approach, namely that if > each of the filters is not exactly the inverse of the other, then we > will either upload too few objects, or delete too many. Thanks for bringing this up again. I meant to write back regarding what you raised in the other part of this thread. I think this is a valid concern. To attain the goal of offloading certain blobs onto another server(B) and saving space on a git server(A), then there will essentially be two steps. One to upload objects to (B), and one to remove objects from (A). As you said, these two need to be the inverse of each other or else you might end up with missing objects. Thinking about it more, there is also an issue of timing. Even if the filters are exact inverses of each other, let's say we have the following order of events: - (A)'s large blobs get upload to (B) - large blob (C) get added to (A) - (A) gets repacked with a filter In this case we could lose (C) forever. So it does seem like we need some built in guarantee that we only shed objects from the repo if we know we can retrieve them later on. > > My other concern was around what guarantees we currently provide for a > promisor remote. My understanding is that we expect an object which was > received from the promisor remote to always be fetch-able later on. If > that's the case, then I don't mind the idea of refiltering a repository, > provided that you only need to specify a filter once. Could you clarify what you mean by re-filtering a repository? By that I assumed it meant specifying a filter eg: 100mb, and then narrowing it by specifying a 50mb filter. > > So the suggestion about splitting a repository into two packs was a > potential way to mediate the "two filter" problem, since the two packs > you get exactly correspond to the set of objects that match the filter, > and the set of objects that _don't_ match the filter. > > In either case, I tried to use the patches in [1] and was able to > corrupt my local repository (even when fetching from a remote that held > onto the objects I had pruned locally). > > Thanks, > Taylor Thanks! John > > [1]: https://lore.kernel.org/git/YhUeUCIetu%2FaOu6k@nand.local/ > [2]: https://gitlab.com/chriscool/partial-clone-demo/-/blob/master/http-promisor/server_demo.txt#L47-52
On Sat, Feb 26, 2022 at 03:19:11PM -0500, John Cai wrote: > Thanks for bringing this up again. I meant to write back regarding what you raised > in the other part of this thread. I think this is a valid concern. To attain the > goal of offloading certain blobs onto another server(B) and saving space on a git > server(A), then there will essentially be two steps. One to upload objects to (B), > and one to remove objects from (A). As you said, these two need to be the inverse of each > other or else you might end up with missing objects. Do you mean that you want to offload objects both from a local clone of some repository, _and_ the original remote it was cloned from? I don't understand what the role of "another server" is here. If this proposal was about making it easy to remove objects from a local copy of a repository based on a filter provided that there was a Git server elsewhere that could act as a promisor remote, than that makes sense to me. But I think I'm not quite understanding the rest of what you're suggesting. > > My other concern was around what guarantees we currently provide for a > > promisor remote. My understanding is that we expect an object which was > > received from the promisor remote to always be fetch-able later on. If > > that's the case, then I don't mind the idea of refiltering a repository, > > provided that you only need to specify a filter once. > > Could you clarify what you mean by re-filtering a repository? By that I assumed > it meant specifying a filter eg: 100mb, and then narrowing it by specifying a > 50mb filter. I meant: applying a filter to a local clone (either where there wasn't a filter before, or a filter which matched more objects) and then removing objects that don't match the filter. But your response makes me think of another potential issue. What happens if I do the following: $ git repack -ad --filter=blob:limit=100k $ git repack -ad --filter=blob:limit=200k What should the second invocation do? I would expect that it needs to do a fetch from the promisor remote to recover any blobs between (100, 200] KB in size, since they would be gone after the first repack. This is a problem not just with two consecutive `git repack --filter`s, I think, since you could cook up the same situation with: $ git clone --filter=blob:limit=100k git@github.com:git $ git -C git repack -ad --filter=blob:limit=200k I don't think the existing patches handle this situation, so I'm curious whether it's something you have considered or not before. (Unrelated to the above, but please feel free to trim any quoted parts of emails when responding if they get overly long.) Thanks, Taylor
Hi Taylor, On 26 Feb 2022, at 15:30, Taylor Blau wrote: > On Sat, Feb 26, 2022 at 03:19:11PM -0500, John Cai wrote: >> Thanks for bringing this up again. I meant to write back regarding what you raised >> in the other part of this thread. I think this is a valid concern. To attain the >> goal of offloading certain blobs onto another server(B) and saving space on a git >> server(A), then there will essentially be two steps. One to upload objects to (B), >> and one to remove objects from (A). As you said, these two need to be the inverse of each >> other or else you might end up with missing objects. > > Do you mean that you want to offload objects both from a local clone of > some repository, _and_ the original remote it was cloned from? yes, exactly. The "another server" would be something like an http server, OR another remote which hosts a subset of the objects (let's say the large blobs). > > I don't understand what the role of "another server" is here. If this > proposal was about making it easy to remove objects from a local copy of > a repository based on a filter provided that there was a Git server > elsewhere that could act as a promisor remote, than that makes sense to > me. > > But I think I'm not quite understanding the rest of what you're > suggesting. Sorry for the lack of clarity here. The goal is to make it easy for a remote to offload a subset of its objects to __another__ remote (either a Git server or an http server through a remote helper). > >>> My other concern was around what guarantees we currently provide for a >>> promisor remote. My understanding is that we expect an object which was >>> received from the promisor remote to always be fetch-able later on. If >>> that's the case, then I don't mind the idea of refiltering a repository, >>> provided that you only need to specify a filter once. >> >> Could you clarify what you mean by re-filtering a repository? By that I assumed >> it meant specifying a filter eg: 100mb, and then narrowing it by specifying a >> 50mb filter. > > I meant: applying a filter to a local clone (either where there wasn't a > filter before, or a filter which matched more objects) and then removing > objects that don't match the filter. > > But your response makes me think of another potential issue. What > happens if I do the following: > > $ git repack -ad --filter=blob:limit=100k > $ git repack -ad --filter=blob:limit=200k > > What should the second invocation do? I would expect that it needs to do > a fetch from the promisor remote to recover any blobs between (100, 200] > KB in size, since they would be gone after the first repack. > > This is a problem not just with two consecutive `git repack --filter`s, > I think, since you could cook up the same situation with: > > $ git clone --filter=blob:limit=100k git@github.com:git > $ git -C git repack -ad --filter=blob:limit=200k > > I don't think the existing patches handle this situation, so I'm curious > whether it's something you have considered or not before. I have not-will have to think through this case, but this sound similar to what [1] is about. is about. > > (Unrelated to the above, but please feel free to trim any quoted parts > of emails when responding if they get overly long.) > > Thanks, > Taylor Thanks John 1. https://lore.kernel.org/git/pull.1138.v2.git.1645719218.gitgitgadget@gmail.com/
On Sat, Feb 26, 2022 at 04:05:37PM -0500, John Cai wrote: > Hi Taylor, > > On 26 Feb 2022, at 15:30, Taylor Blau wrote: > > > On Sat, Feb 26, 2022 at 03:19:11PM -0500, John Cai wrote: > >> Thanks for bringing this up again. I meant to write back regarding what you raised > >> in the other part of this thread. I think this is a valid concern. To attain the > >> goal of offloading certain blobs onto another server(B) and saving space on a git > >> server(A), then there will essentially be two steps. One to upload objects to (B), > >> and one to remove objects from (A). As you said, these two need to be the inverse of each > >> other or else you might end up with missing objects. > > > > Do you mean that you want to offload objects both from a local clone of > > some repository, _and_ the original remote it was cloned from? > > yes, exactly. The "another server" would be something like an http server, OR another remote > which hosts a subset of the objects (let's say the large blobs). > > > > I don't understand what the role of "another server" is here. If this > > proposal was about making it easy to remove objects from a local copy of > > a repository based on a filter provided that there was a Git server > > elsewhere that could act as a promisor remote, than that makes sense to > > me. > > > > But I think I'm not quite understanding the rest of what you're > > suggesting. > > Sorry for the lack of clarity here. The goal is to make it easy for a remote to offload a subset > of its objects to __another__ remote (either a Git server or an http server through a remote helper). Does the other server then act as a promisor remote in conjunction with the Git server? I'm having trouble understanding why the _Git_ remote you originally cloned from needs to offload its objects, too. So I think the list would benefit from understanding some more of the details and motivation there. But it would also benefit us to have some understanding of how we'll ensure that any objects which are moved out of a Git repository make their way to another server. I am curious to hear Jonathan Tan's thoughts on this all, too. Thanks, Taylor
This patch series makes partial clone more useful by making it possible to run repack to remove objects from a repository (replacing it with promisor objects). This is useful when we want to offload large blobs from a git server onto another git server, or even use an http server through a remote helper. In [A], a --refilter option on fetch and fetch-pack is being discussed where either a less restrictive or more restrictive filter can be used. In the more restrictive case, the objects that already exist will not be deleted. But, one can imagine that users might want the ability to delete objects when they apply a more restrictive filter in order to save space, and this patch series would also allow that. There are a couple of things we need to adjust to make this possible. This patch has three parts. 1. Allow --filter in pack-objects without --stdout 2. Add a --filter flag for repack 3. Allow missing promisor objects in upload-pack 4. Tests that demonstrate the ability to offload objects onto an http remote cc: Christian Couder christian.couder@gmail.com cc: Derrick Stolee stolee@gmail.com cc: Robert Coup robert@coup.net.nz A. https://lore.kernel.org/git/pull.1138.git.1643730593.gitgitgadget@gmail.com/ John Cai (4): pack-objects: allow --filter without --stdout repack: add --filter=<filter-spec> option upload-pack: allow missing promisor objects tests for repack --filter mode Documentation/git-repack.txt | 5 + builtin/pack-objects.c | 2 - builtin/repack.c | 22 +++-- t/lib-httpd.sh | 2 + t/lib-httpd/apache.conf | 8 ++ t/lib-httpd/list.sh | 43 +++++++++ t/lib-httpd/upload.sh | 46 +++++++++ t/t0410-partial-clone.sh | 81 ++++++++++++++++ t/t0410/git-remote-testhttpgit | 170 +++++++++++++++++++++++++++++++++ t/t7700-repack.sh | 20 ++++ upload-pack.c | 5 + 11 files changed, 395 insertions(+), 9 deletions(-) create mode 100644 t/lib-httpd/list.sh create mode 100644 t/lib-httpd/upload.sh create mode 100755 t/t0410/git-remote-testhttpgit base-commit: 38062e73e009f27ea192d50481fcb5e7b0e9d6eb Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1206%2Fjohn-cai%2Fjc-repack-filter-v2 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1206/john-cai/jc-repack-filter-v2 Pull-Request: https://github.com/git/git/pull/1206 Range-diff vs v1: 1: 0eec9b117da = 1: f43b76ca650 pack-objects: allow --filter without --stdout -: ----------- > 2: 6e7c8410b8d repack: add --filter=<filter-spec> option -: ----------- > 3: 40612b9663b upload-pack: allow missing promisor objects 2: a3166381572 ! 4: d76faa1f16e repack: add --filter=<filter-spec> option @@ Metadata Author: John Cai <johncai86@gmail.com> ## Commit message ## - repack: add --filter=<filter-spec> option + tests for repack --filter mode - Currently, repack does not work with partial clones. When repack is run - on a partially cloned repository, it grabs all missing objects from - promisor remotes. This also means that when gc is run for repository - maintenance on a partially cloned repository, it will end up getting - missing objects, which is not what we want. - - In order to make repack work with partial clone, teach repack a new - option --filter, which takes a <filter-spec> argument. repack will skip - any objects that are matched by <filter-spec> similar to how the clone - command will skip fetching certain objects. - - The final goal of this feature, is to be able to store objects on a - server other than the regular git server itself. + This patch adds tests to test both repack --filter functionality in + isolation (in t7700-repack.sh) as well as how it can be used to offload + large blobs (in t0410-partial-clone.sh) There are several scripts added so we can test the process of using a - remote helper to upload blobs to an http server: + remote helper to upload blobs to an http server. - t/lib-httpd/list.sh lists blobs uploaded to the http server. - t/lib-httpd/upload.sh uploads blobs to the http server. @@ Commit message Based-on-patch-by: Christian Couder <chriscool@tuxfamily.org> Signed-off-by: John Cai <johncai86@gmail.com> - ## Documentation/git-repack.txt ## -@@ Documentation/git-repack.txt: depth is 4095. - a larger and slower repository; see the discussion in - `pack.packSizeLimit`. - -+--filter=<filter-spec>:: -+ Omits certain objects (usually blobs) from the resulting -+ packfile. See linkgit:git-rev-list[1] for valid -+ `<filter-spec>` forms. -+ - -b:: - --write-bitmap-index:: - Write a reachability bitmap index as part of the repack. This - - ## builtin/repack.c ## -@@ builtin/repack.c: struct pack_objects_args { - const char *depth; - const char *threads; - const char *max_pack_size; -+ const char *filter; - int no_reuse_delta; - int no_reuse_object; - int quiet; -@@ builtin/repack.c: static void prepare_pack_objects(struct child_process *cmd, - strvec_pushf(&cmd->args, "--threads=%s", args->threads); - if (args->max_pack_size) - strvec_pushf(&cmd->args, "--max-pack-size=%s", args->max_pack_size); -+ if (args->filter) -+ strvec_pushf(&cmd->args, "--filter=%s", args->filter); - if (args->no_reuse_delta) - strvec_pushf(&cmd->args, "--no-reuse-delta"); - if (args->no_reuse_object) -@@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix) - N_("limits the maximum number of threads")), - OPT_STRING(0, "max-pack-size", &po_args.max_pack_size, N_("bytes"), - N_("maximum size of each packfile")), -+ OPT_STRING(0, "filter", &po_args.filter, N_("args"), -+ N_("object filtering")), - OPT_BOOL(0, "pack-kept-objects", &pack_kept_objects, - N_("repack objects in packs marked with .keep")), - OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"), -@@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix) - if (line.len != the_hash_algo->hexsz) - die(_("repack: Expecting full hex object ID lines only from pack-objects.")); - string_list_append(&names, line.buf); -+ if (po_args.filter) { -+ char *promisor_name = mkpathdup("%s-%s.promisor", packtmp, -+ line.buf); -+ write_promisor_file(promisor_name, NULL, 0); -+ } - } - fclose(out); - ret = finish_command(&cmd); - ## t/lib-httpd.sh ## @@ t/lib-httpd.sh: prepare_httpd() { install_script error-smart-http.sh @@ t/t0410-partial-clone.sh: test_expect_success 'fetching of missing objects from + git -C server rev-list --objects --all --missing=print >objects && + grep "$sha" objects +' ++ ++test_expect_success 'fetch does not cause server to fetch missing objects' ' ++ rm -rf origin server client && ++ test_create_repo origin && ++ dd if=/dev/zero of=origin/file1 bs=801k count=1 && ++ git -C origin add file1 && ++ git -C origin commit -m "large blob" && ++ sha="$(git -C origin rev-parse :file1)" && ++ expected="?$(git -C origin rev-parse :file1)" && ++ git clone --bare --no-local origin server && ++ git -C server remote add httpremote "testhttpgit::${PWD}/server" && ++ git -C server config remote.httpremote.promisor true && ++ git -C server config --remove-section remote.origin && ++ git -C server rev-list --all --objects --filter-print-omitted \ ++ --filter=blob:limit=800k | perl -ne "print if s/^[~]//" \ ++ >large_blobs.txt && ++ upload_blobs_from_stdin server <large_blobs.txt && ++ git -C server -c repack.writebitmaps=false repack -a -d \ ++ --filter=blob:limit=800k && ++ git -C server config uploadpack.allowmissingpromisor true && ++ git clone -c remote.httpremote.url="testhttpgit::${PWD}/server" \ ++ -c remote.httpremote.fetch='+refs/heads/*:refs/remotes/httpremote/*' \ ++ -c remote.httpremote.promisor=true --bare --no-local \ ++ --filter=blob:limit=800k server client && ++ git -C client rev-list --objects --all --missing=print >client_objects && ++ grep "$expected" client_objects && ++ git -C server rev-list --objects --all --missing=print >server_objects && ++ grep "$expected" server_objects ++' + # DO NOT add non-httpd-specific tests here, because the last part of this # test script is only executed when httpd is available and enabled.