Message ID | pull.1785.git.1725890210.gitgitgadget@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | pack-objects: create new name-hash algorithm | expand |
On 9/9/24 9:56 AM, Derrick Stolee via GitGitGadget wrote: > However, my findings show that when a repository has many versions of files > at the same path (and especially when there are many name-hash collisions) > then there are significant gains to be made using the new algorithm. Of course this table didn't render correctly. Here's a readable version: | Repo | Standard Repack | With --full-name-hash | |----------|-----------------|-----------------------| | fluentui | 438 MB | 168 MB | | Repo B | 6,255 MB | 829 MB | | Repo C | 37,737 MB | 7,125 MB | | Repo D | 130,049 MB | 6,190 MB | Thanks, -Stolee
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > One way to find some improvement in these repositories is to increase the > window size, which was an initial indicator that the delta compression could > be improved, but was not a clear indicator. After some digging (and > prototyping some analysis tools) the main discovery was that the current > name-hash algorithm only considers the last 16 characters in the path name > and has some naturally-occurring collisions within that scope. Yes, as I explained in the other message, this "collision" is an integral part of the design to allow us gather candidates together that may yield good deltas among them. In addition, header files whose names end with ".h" tend to share a bit comment at the beginning of them in many projects, and the proximity (not "collision") of the hash value is used to make them delta candidates with each other. I do agree that considering files at the same path from different (but close-by) revisions as the prime candidates is very important, but if you spread the "collissions" very thin by using "uniform distribution", I am afraid that you'd end up discarding anything but the blobs at the same path, which may go too far. Having name hash value that are close by no longer has any meaning in such a system. I hope you can find a solution that strikes a good balance at the end of the series (I saw only the first step), but I suspect an easy way to avoid the downsides you observed is to use both. Compare with a handful of blobs taken from nearby commits (the original object order is roughly in traversal order, and you can take advantage of that fact) from exactly the same path (using your "uniform distribution") before comparing with the blobs with close value (of the current function) like the current implementation does, may go a long way.
On 9/9/24 2:06 PM, Junio C Hamano wrote: > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > >> One way to find some improvement in these repositories is to increase the >> window size, which was an initial indicator that the delta compression could >> be improved, but was not a clear indicator. After some digging (and >> prototyping some analysis tools) the main discovery was that the current >> name-hash algorithm only considers the last 16 characters in the path name >> and has some naturally-occurring collisions within that scope. > > Yes, as I explained in the other message, this "collision" is an > integral part of the design to allow us gather candidates together > that may yield good deltas among them. In addition, header files > whose names end with ".h" tend to share a bit comment at the > beginning of them in many projects, and the proximity (not > "collision") of the hash value is used to make them delta candidates > with each other. > > I do agree that considering files at the same path from different > (but close-by) revisions as the prime candidates is very important, > but if you spread the "collissions" very thin by using "uniform > distribution", I am afraid that you'd end up discarding anything but > the blobs at the same path, which may go too far. Having name hash > value that are close by no longer has any meaning in such a system. You are right that some "nearby" paths are lost in this change, and this can be measured by trying to use this option in the pack-objects process underneath a small 'git push'. The thing that surprised me is just how effective this is for the creation of large pack-files that include many versions of most files. The cross-path deltas have less of an effect here, and the benefits of avoiding name-hash collisions can be overwhelming in many cases. > I hope you can find a solution that strikes a good balance at the > end of the series (I saw only the first step), but I suspect an easy > way to avoid the downsides you observed is to use both. Compare > with a handful of blobs taken from nearby commits (the original > object order is roughly in traversal order, and you can take > advantage of that fact) from exactly the same path (using your > "uniform distribution") before comparing with the blobs with close > value (of the current function) like the current implementation > does, may go a long way. Funny you should say that, since the RFC I finally submitted [1] actually does just that. The --path-walk option changes the object walk to consider batches of objects based on their path, computes deltas among that batch, and then does the normal name-hash pass later. This seems to really strike the balance that you are looking for and solves the issues where small pushes need to stay small. It also fixes some problematic cases even when pushing a single commit. [1] https://lore.kernel.org/git/pull.1786.git.1725935335.gitgitgadget@gmail.com/ However, the --path-walk option requires significant implementation of a "path walk API" and my RFC doesn't even do threading right. The --path-walk version (probably) doesn't work with delta islands or other features the same way as the drop-in change to the name- hash heuristic can. For that reason, I think there is likely some long-term value to the --full-name-hash option even though the --path-walk option will be better in many cases. Thanks, -Stolee
On Mon, Sep 09, 2024 at 10:37:30PM -0400, Derrick Stolee wrote: > > I do agree that considering files at the same path from different > > (but close-by) revisions as the prime candidates is very important, > > but if you spread the "collissions" very thin by using "uniform > > distribution", I am afraid that you'd end up discarding anything but > > the blobs at the same path, which may go too far. Having name hash > > value that are close by no longer has any meaning in such a system. > > You are right that some "nearby" paths are lost in this change, and > this can be measured by trying to use this option in the pack-objects > process underneath a small 'git push'. > > The thing that surprised me is just how effective this is for the > creation of large pack-files that include many versions of most > files. The cross-path deltas have less of an effect here, and the > benefits of avoiding name-hash collisions can be overwhelming in > many cases. I think that Junio's suggestion is pretty interesting (though please take my comments here with a grain of salt, since I haven't read the other series yet, and am not sure how much of this is redundant). Imagine computing both the full and existing name-hash values for each blob/tree in the pack. Then objects would be sorted in the delta selection window by similar full-name hash and similar regular name hash values. I'm not sure which value you'd actually record in the pack, though. Ideally there is a hash function which captures some information about the full path as well as the final path component, so we could use a single value here, though I suspect the implementation would be more complicated than what is presented here. > > I hope you can find a solution that strikes a good balance at the > > end of the series (I saw only the first step), but I suspect an easy > > way to avoid the downsides you observed is to use both. Compare > > with a handful of blobs taken from nearby commits (the original > > object order is roughly in traversal order, and you can take > > advantage of that fact) from exactly the same path (using your > > "uniform distribution") before comparing with the blobs with close > > value (of the current function) like the current implementation > > does, may go a long way. > > Funny you should say that, since the RFC I finally submitted [1] > actually does just that. The --path-walk option changes the object > walk to consider batches of objects based on their path, computes > deltas among that batch, and then does the normal name-hash pass > later. This seems to really strike the balance that you are > looking for and solves the issues where small pushes need to stay > small. It also fixes some problematic cases even when pushing a > single commit. Interesting. > [1] https://lore.kernel.org/git/pull.1786.git.1725935335.gitgitgadget@gmail.com/ > However, the --path-walk option requires significant implementation > of a "path walk API" and my RFC doesn't even do threading right. > The --path-walk version (probably) doesn't work with delta islands > or other features the same way as the drop-in change to the name- > hash heuristic can. For that reason, I think there is likely some > long-term value to the --full-name-hash option even though the > --path-walk option will be better in many cases. I suspect that this is going to be a significant sticking point. Not supporting multi-threading is work-able for GitHub (since we set pack.threads=1 today), but lacking support for delta-islands makes this a non-starter to run at GitHub. Do you imagine that the --path-walk option could be made to work with delta islands? I'm not all that worried about who does that work, but more interested at this point in whether or not it's even possible to implement. Thanks, Taylor
Derrick Stolee <stolee@gmail.com> writes: > The thing that surprised me is just how effective this is for the > creation of large pack-files that include many versions of most > files. The cross-path deltas have less of an effect here, and the > benefits of avoiding name-hash collisions can be overwhelming in > many cases. Yes, "make sure we notice a file F moving from directory A to B" is inherently optimized for short span of history, i.e. a smallish push rather than a whole history clone, where the definition of "smallish" is that even if you create optimal delta chains, the length of these delta chains will not exceed the "--depth" option. If the history you are pushing modified A/F twice, renamed it to B/F (with or without modification at the same time), then modified B/F twice more, you'd want to pack the 5-commit segment and having to artificially cut the delta chain that can contain all of these 5 blobs into two at the renaming commit is a huge loss. Compared to that, a whole history clone is a very different story, as we will have to chomp delta chains at some depth anyway. Before the rename, it is reasonable to assume that A/F have evolved incrementally for number of revisions, and after that rename it is expected B/F will evolve incrementally for number of revisions before it gets renamed again. It is just the matter of choosing where in that long stretch of content evolution we would cut the delta chain, and the commit that renamed the path may just be a good, if not absolute optimal, point. So I do agree that this is an important case to optimize for. At some point, even when taking only the blobs at the same path as delta base candidates, your true best base may be outside of the --window in the list of candidates (sorted by size in decreasing order), but at that point you would be increasing window to find better delta base, not to skip unrelated blobs that happened to have thrown into the same hash bucket due to the design that optimizes for different case, so we can say that it is worth spending the extra cycle and memory, if you need a larger window to gain even better packing result. > Funny you should say that, since the RFC I finally submitted [1] > actually does just that. The --path-walk option changes the object > walk to consider batches of objects based on their path, computes > deltas among that batch, and then does the normal name-hash pass > later. This seems to really strike the balance that you are > looking for and solves the issues where small pushes need to stay > small. It also fixes some problematic cases even when pushing a > single commit. ;-).
Junio C Hamano <gitster@pobox.com> writes: > Derrick Stolee <stolee@gmail.com> writes: > >> The thing that surprised me is just how effective this is for the >> creation of large pack-files that include many versions of most >> files. The cross-path deltas have less of an effect here, and the >> benefits of avoiding name-hash collisions can be overwhelming in >> many cases. > > Yes, "make sure we notice a file F moving from directory A to B" is > inherently optimized for short span of history, i.e. a smallish push > rather than a whole history clone, where the definition of > "smallish" is that even if you create optimal delta chains, the > length of these delta chains will not exceed the "--depth" option. > > If the history you are pushing modified A/F twice, renamed it to B/F > (with or without modification at the same time), then modified B/F > twice more, you'd want to pack the 5-commit segment and having to > artificially cut the delta chain that can contain all of these 5 > blobs into two at the renaming commit is a huge loss. Which actually leads me to suspect that we probably do not even have to expose the --full-name-hash option to the end users in "git repack". If we are doing incremental that would fit within the depth setting, it is likely that we would be better off without the full-name-hash optimization, and if we are doing "repack -a" for the whole repository, especially with "-f", it would make sense to do the full-name-hash optimization. If we can tell how large a chunk of history we are packing before we actually start calling builtin/pack-objects.c:add_object_entry(), we probably should be able to even select between with and without full-name-hash automatically, but I do not think we know the object count before we finish calling add_object_entry(), so unless we are willing to compute and keep both while reading and pick between the two after we finish reading the list of objects, or something, it will require a major surgery to do so, I am afraid.
On 9/10/24 10:56 AM, Taylor Blau wrote: > On Mon, Sep 09, 2024 at 10:37:30PM -0400, Derrick Stolee wrote: >>> I do agree that considering files at the same path from different >>> (but close-by) revisions as the prime candidates is very important, >>> but if you spread the "collissions" very thin by using "uniform >>> distribution", I am afraid that you'd end up discarding anything but >>> the blobs at the same path, which may go too far. Having name hash >>> value that are close by no longer has any meaning in such a system. >> >> You are right that some "nearby" paths are lost in this change, and >> this can be measured by trying to use this option in the pack-objects >> process underneath a small 'git push'. >> >> The thing that surprised me is just how effective this is for the >> creation of large pack-files that include many versions of most >> files. The cross-path deltas have less of an effect here, and the >> benefits of avoiding name-hash collisions can be overwhelming in >> many cases. > > I think that Junio's suggestion is pretty interesting (though please > take my comments here with a grain of salt, since I haven't read the > other series yet, and am not sure how much of this is redundant). > > Imagine computing both the full and existing name-hash values for each > blob/tree in the pack. Then objects would be sorted in the delta > selection window by similar full-name hash and similar regular name hash > values. > > I'm not sure which value you'd actually record in the pack, though. > Ideally there is a hash function which captures some information about > the full path as well as the final path component, so we could use a > single value here, though I suspect the implementation would be more > complicated than what is presented here. Is the name hash stored in the pack itself? I know that it is stored in the 'struct object_entry' data in the packing data. While we could add another uint32_t into that struct to store both hash values, this would increase the memory requirements of repacking by four bytes per object. The struct seemed to be very clear about trying as hard as possible to avoid doing that. But maybe an alternative could be replacing that 32-bit number with an index into an array of paths that have their hash values stored there. This would still involve two passes, but might still be possible. I'll think on this. >>> I hope you can find a solution that strikes a good balance at the >>> end of the series (I saw only the first step), but I suspect an easy >>> way to avoid the downsides you observed is to use both. Compare >>> with a handful of blobs taken from nearby commits (the original >>> object order is roughly in traversal order, and you can take >>> advantage of that fact) from exactly the same path (using your >>> "uniform distribution") before comparing with the blobs with close >>> value (of the current function) like the current implementation >>> does, may go a long way. >> >> Funny you should say that, since the RFC I finally submitted [1] >> actually does just that. The --path-walk option changes the object >> walk to consider batches of objects based on their path, computes >> deltas among that batch, and then does the normal name-hash pass >> later. This seems to really strike the balance that you are >> looking for and solves the issues where small pushes need to stay >> small. It also fixes some problematic cases even when pushing a >> single commit. > > Interesting. > >> [1] https://lore.kernel.org/git/pull.1786.git.1725935335.gitgitgadget@gmail.com/ > >> However, the --path-walk option requires significant implementation >> of a "path walk API" and my RFC doesn't even do threading right. >> The --path-walk version (probably) doesn't work with delta islands >> or other features the same way as the drop-in change to the name- >> hash heuristic can. For that reason, I think there is likely some >> long-term value to the --full-name-hash option even though the >> --path-walk option will be better in many cases. > > I suspect that this is going to be a significant sticking point. Not > supporting multi-threading is work-able for GitHub (since we set > pack.threads=1 today), but lacking support for delta-islands makes this > a non-starter to run at GitHub. > > Do you imagine that the --path-walk option could be made to work with > delta islands? I'm not all that worried about who does that work, but > more interested at this point in whether or not it's even possible to > implement. This is part of the reason why I think the --full-name-hash option is an interesting consideration. It doesn't have any obvious reason why it couldn't work with features like delta islands, so it may provide some quick wins in "large enough" repositories, or at least "large in the right way". I unfortunately don't know enough about how the delta islands feature works to be confident in the possibility of integrating it with the --path-walk option. At minimum, it would require two object walks: the first would mark the objects and the second would do the delta compression with those markings in mind. But if there is a way to combine both approaches with a two-pass delta compression technique, then this may be all avoided. I'll give it a try. Thanks, -Stolee
On 9/10/24 4:36 PM, Junio C Hamano wrote: > Junio C Hamano <gitster@pobox.com> writes: > >> Derrick Stolee <stolee@gmail.com> writes: >> >>> The thing that surprised me is just how effective this is for the >>> creation of large pack-files that include many versions of most >>> files. The cross-path deltas have less of an effect here, and the >>> benefits of avoiding name-hash collisions can be overwhelming in >>> many cases. >> >> Yes, "make sure we notice a file F moving from directory A to B" is >> inherently optimized for short span of history, i.e. a smallish push >> rather than a whole history clone, where the definition of >> "smallish" is that even if you create optimal delta chains, the >> length of these delta chains will not exceed the "--depth" option. >> >> If the history you are pushing modified A/F twice, renamed it to B/F >> (with or without modification at the same time), then modified B/F >> twice more, you'd want to pack the 5-commit segment and having to >> artificially cut the delta chain that can contain all of these 5 >> blobs into two at the renaming commit is a huge loss. > > Which actually leads me to suspect that we probably do not even have > to expose the --full-name-hash option to the end users in "git repack". > > If we are doing incremental that would fit within the depth setting, > it is likely that we would be better off without the full-name-hash > optimization, and if we are doing "repack -a" for the whole > repository, especially with "-f", it would make sense to do the > full-name-hash optimization. Depending on how much we learn from others testing the --full-name-hash option, I could see the potential that -a could imply --full-name-hash. I hesitate to introduce that in the first release with this option, though. > If we can tell how large a chunk of history we are packing before we > actually start calling builtin/pack-objects.c:add_object_entry(), we > probably should be able to even select between with and without > full-name-hash automatically, but I do not think we know the object > count before we finish calling add_object_entry(), so unless we are > willing to compute and keep both while reading and pick between the > two after we finish reading the list of objects, or something, it > will require a major surgery to do so, I am afraid. It's also possible that we could check the list of paths at HEAD to see how many collisions the default name-hash gives. In cases like the Git repository, there are very few collisions and thus we don't need to use --full-name-hash. Restricting to just HEAD (or the default ref) is not a complete analysis, but might be a good heuristic. Thanks, -Stolee
On Tue, Sep 10, 2024 at 05:05:09PM -0400, Derrick Stolee wrote: > > I'm not sure which value you'd actually record in the pack, though. > > Ideally there is a hash function which captures some information about > > the full path as well as the final path component, so we could use a > > single value here, though I suspect the implementation would be more > > complicated than what is presented here. > > Is the name hash stored in the pack itself? I know that it is stored > in the 'struct object_entry' data in the packing data. While we could > add another uint32_t into that struct to store both hash values, this > would increase the memory requirements of repacking by four bytes per > object. The struct seemed to be very clear about trying as hard as > possible to avoid doing that. It's stored in the .bitmap files, since otherwise a pack-objects which uses bitmaps to serve a fetch would have no clue of their path names. See the "HASH_CACHE" bitmap extension. You generally don't want to make deltas out of two entries in the bitmap (they're already in the same pack, so we'd usually skip them), but you do want to consider making on-the-fly deltas against other objects. I guess we may also consider deltas between objects in two packs that are both covered by the same midx bitmap. > But maybe an alternative could be replacing that 32-bit number with > an index into an array of paths that have their hash values stored > there. Yes, that would work, though how big is that path array going to be? Uncompressed linux.git is probably 3-4MB, which actually doesn't sound _too_ bad. You could obviously go a long way with prefix compression, too. But if I understand the proposal, it is just replacing one 32-bit hash with another. You could just store that in the bitmap instead (or if the direction is to use both, introduce a new extension to store both). Obviously you'll get lousy results if the bitmap reader does not use the same algorithm for its non-bitmap objects, but I don't think this is something you'd be flipping back and forth on. > This is part of the reason why I think the --full-name-hash option is > an interesting consideration. It doesn't have any obvious reason why > it couldn't work with features like delta islands, so it may provide > some quick wins in "large enough" repositories, or at least "large in > the right way". > > I unfortunately don't know enough about how the delta islands feature > works to be confident in the possibility of integrating it with the > --path-walk option. At minimum, it would require two object walks: > the first would mark the objects and the second would do the delta > compression with those markings in mind. The delta islands code already does its own tree walk to propagate the bits down (it does rely on the base walk's show_commit() to propagate through the commits). Once each object has its island bitmaps, I think however you choose to come up with delta candidates (whether the current type/size/namehash sorted list, or some path walking), you should be able to use it. It's fundamentally just answering the question of "am I allowed to delta between these two objects". Of course the devil may be in the details. ;) -Peff