Message ID | pull.969.v3.git.1624349082.gitgitgadget@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | Optimization batch 13: partial clone optimizations for merge-ort | expand |
On 6/22/2021 4:04 AM, Elijah Newren via GitGitGadget wrote: > This series optimizes blob downloading in merges for partial clones. It can > apply on master. It's independent of ort-perf-batch-12. As promised, I completed a performance evaluation of this series as well as ort-perf-batch-12 (and all earlier batches) against our microsoft/git fork and running in one of our large monorepos that has over 2 million files at HEAD. Here are my findings. In my comparisons, I compare the recursive merge strategy with renames disabled against the ORT strategy (which always has renames enabled). When I enabled renames for the recursive merge I saw the partial clone logic kick in and start downloading many files in every case, so I dropped that from consideration. Experiment #1: Large RI/FI merges --------------------------------- Most of the merge commits in this repo's history are merges of several long-lived branches as code is merged across organizational boundaries. I focused on the merge commits in the first-parent history, which mean these are the merges that brought the latest changes from several areas into the canonical version of the software. These merges are all automated merges created by libgit2's implementation of the recursive merge strategy. Since they are created on the server, these will not have any merge conflicts. They are still interesting because the sheer number of files that change can be large. This is a pain point for the recursive merge because many index entries need to update with the merge. For ORT, some of the updates are simple because only one side changed a certain subtree (the organizational boundaries also correspond to the directory structure in many cases). Across these merges I tested, ORT was _always_ faster and was consistent with the recursive strategy. Even more interesting was the fact that the recursive strategy had very slow outliers while the ORT strategy was much more consistent: Recursive ORT ----------------------- MAX 34.97s 4.74s P90 30.04s 4.50s P75 15.35s 3.74s P50 7.22s 3.39s P10 3.61s 3.08s (I'm not testing ORT with the sparse-index yet. A significant portion of this 3 second lower bound is due to reading and writing the index file with 2 million entries. I _am_ using sparse-checkout with only the files at root, which minimizes the time required to update the working directory with any changed files.) For these merges, ORT is a clear win. Experiment #2: User-created merges ---------------------------------- To find merges that might be created by actual user cases, I ran 'git rev-list --grep="^Merge branch"' to get merges that had default messages from 'git merge' runs. (The merges from Experiment #1 had other automated names that did not appear in this search.) Here, the differences are less striking, but still valuable: Recursive ORT ----------------------- MAX 10.61s 6.27s P75 8.81s 3.92s P50 4.32s 3.21s P10 3.53s 2.95s The ORT strategy had more variance in these examples, though still not as much as the recursive strategy. Here the variance is due to conflicting files needing content merges, which usually were automatically resolved. This version of the experiment provided interesting observations in a few cases: 1. One case had the recursive merge strategy result in a root tree that disagreed with what the user committed, but the ORT strategy _did_ the correct resolution. Likely, this is due to the rename detection and resolution. The user probably had to manually resolve the merge to match their expected renames since we turn off merge.renames in their config. 2. I watched for the partial clone logic to kick in and download blobs. Some of these were inevitable: we need the blobs to resolve edit/edit conflicts. Most cases none were downloaded at all, so this series is working as advertised. There _was_ a case where the inexact rename detection requested a large list of files (~2900 in three batches) but _then_ said "inexact rename detection was skipped due to too many files". This is a case that would be nice to resolve in this series. I will try to find exactly where in the code this is being triggered and report back. 3. As I mentioned, I was using sparse-checkout to limit the size of the working directory. In one case of a conflict that could not be automatically resolved, the ORT strategy output this error: error: could not open '<X>': No such file or directory It seems we are looking for a file on disk without considering if it might have the SKIP_WORKTREE bit on in the index. I don't think this is an issue for this series, but might require a follow-up on top of the other ORT work. Conclusions ----------- I continue to be excited about the ORT strategy and will likely be focusing on it in a month or so to integrate it with the sparse-index. I think we would be interested in making the ORT strategy a new default for Scalar, but we might really want it to respect merge.renames=false if only so we can deploy the settings in stages (first, change the strategy, then enable renames as an independent step) so we can isolate concerns. Thanks! -Stolee
On Tue, Jun 22, 2021 at 9:10 AM Derrick Stolee <stolee@gmail.com> wrote: > > On 6/22/2021 4:04 AM, Elijah Newren via GitGitGadget wrote: > > This series optimizes blob downloading in merges for partial clones. It can > > apply on master. It's independent of ort-perf-batch-12. > > As promised, I completed a performance evaluation of this series as well > as ort-perf-batch-12 (and all earlier batches) against our microsoft/git > fork and running in one of our large monorepos that has over 2 million > files at HEAD. Here are my findings. Nice, thanks for the investigation and the detailed report! > In my comparisons, I compare the recursive merge strategy with renames > disabled against the ORT strategy (which always has renames enabled). When > I enabled renames for the recursive merge I saw the partial clone logic > kick in and start downloading many files in every case, so I dropped that > from consideration. I understand it'd take way too long to do an exhaustive testing, but just doing a couple example comparisons against recursive with rename detection turned on (and merge.renameLimit set to e.g. 0, or to some high number if needed) would still be interesting. > Experiment #1: Large RI/FI merges > --------------------------------- > ... > Recursive ORT > ----------------------- > MAX 34.97s 4.74s > P90 30.04s 4.50s > P75 15.35s 3.74s > P50 7.22s 3.39s > P10 3.61s 3.08s > > (I'm not testing ORT with the sparse-index yet. A significant portion of > this 3 second lower bound is due to reading and writing the index file > with 2 million entries. I _am_ using sparse-checkout with only the files > at root, which minimizes the time required to update the working directory > with any changed files.) > > For these merges, ORT is a clear win. Wahoo! I suspect I know what you're seeing from recursive as well; recursive had some quadratic behaviors found within it, separate from the calls into diffcore-rename. It was much less likely to be triggered with rename detection turned off, but could still be triggered. I suspect you just had enough testcases that some of them were affected. (I particularly suspect the item described at (5a) of https://lore.kernel.org/git/pull.842.git.1612331345.gitgitgadget@gmail.com/) Given the 2 million entries, I'm curious if ort-perf-batch-14 (not yet submitted; I've been waiting a few weeks for ort-perf-batch-12 to hit next) will provide further significant benefits. If it really does take 3s or so to do index reading and writing, then it'd only be able to help with the slower cases, but I'm still curious. It helped rather dramatically on my linux kernel testcase with lots of renames. > Experiment #2: User-created merges > ---------------------------------- > > To find merges that might be created by actual user cases, I ran > 'git rev-list --grep="^Merge branch"' to get merges that had default > messages from 'git merge' runs. (The merges from Experiment #1 had other > automated names that did not appear in this search.) > > Here, the differences are less striking, but still valuable: > > Recursive ORT > ----------------------- > MAX 10.61s 6.27s > P75 8.81s 3.92s > P50 4.32s 3.21s > P10 3.53s 2.95s > > The ORT strategy had more variance in these examples, though still not as > much as the recursive strategy. Here the variance is due to conflicting > files needing content merges, which usually were automatically resolved. Is the max case on the ort side related to item #2 below, where perhaps there was a case where there was no automatic resolution of some file? > This version of the experiment provided interesting observations in a few > cases: > > 1. One case had the recursive merge strategy result in a root tree that > disagreed with what the user committed, but the ORT strategy _did_ the > correct resolution. Likely, this is due to the rename detection and > resolution. The user probably had to manually resolve the merge to > match their expected renames since we turn off merge.renames in their > config. I agree that seems like the likely explanation. > 2. I watched for the partial clone logic to kick in and download blobs. > Some of these were inevitable: we need the blobs to resolve edit/edit > conflicts. Most cases none were downloaded at all, so this series is > working as advertised. There _was_ a case where the inexact rename > detection requested a large list of files (~2900 in three batches) but > _then_ said "inexact rename detection was skipped due to too many > files". This is a case that would be nice to resolve in this series. I > will try to find exactly where in the code this is being triggered and > report back. This suggests perhaps that EITHER there was a real modify/delete conflict (because you have to do full rename detection to rule out that the modify/delete was part of some rename), OR that there was a renamed file modified on both sides that did not keep its original basename (because that combination is needed to bypass the various optimizations and make it fall back to full inexact rename detection). Further, in either case, there were enough adds/deletes that full inexact detection is still a bit expensive. It'd be interesting to know which case it was. What happens if you set merge.renameLimit to something higher (the default is surprisingly small)? > 3. As I mentioned, I was using sparse-checkout to limit the size of the > working directory. In one case of a conflict that could not be > automatically resolved, the ORT strategy output this error: > > error: could not open '<X>': No such file or directory > > It seems we are looking for a file on disk without considering if it > might have the SKIP_WORKTREE bit on in the index. I don't think this is > an issue for this series, but might require a follow-up on top of the > other ORT work. I suspect either checkout() or record_conflicted_index_entries() are to blame, and yeah, it'd be an issue from a previous series. If you have any further details or even hints about reproducing (doesn't have to be a complete set of steps or minimal testcase), I would love to hear it. (Incidentally, of all the reviews on merge-ort, those two functions were left out; look for "17" and "19" in https://lore.kernel.org/git/pull.923.git.git.1606635803.gitgitgadget@gmail.com/ ) > Conclusions > ----------- > > I continue to be excited about the ORT strategy and will likely be > focusing on it in a month or so to integrate it with the sparse-index. I > think we would be interested in making the ORT strategy a new default for > Scalar, but we might really want it to respect merge.renames=false if only > so we can deploy the settings in stages (first, change the strategy, then > enable renames as an independent step) so we can isolate concerns. Given that my ort performance work concentrated on rename detection, it's encouraging to see that ort-WITH-renames is generally faster for you than recursive-WITHOUT-renames. That means not only that rename detection can be done in reasonable time now, but that the improvements outside of rename detection show some real benefits as well. It'll be interesting to see how the final two performance series for merge-ort, which will concentrate on performance on areas other than rename detection, will affect these above results. And by the way, thanks for the herculean review effort you've put in. You've reviewed nearly every merge-ort and diffcore-rename patch (which included many big, tricky patches)...and a quick search says there's been at least 128 patches so far in the last 8 months. And the code is better (cleaner, clearer, and faster) for all your reviews.
On 6/22/2021 2:45 PM, Elijah Newren wrote: > On Tue, Jun 22, 2021 at 9:10 AM Derrick Stolee <stolee@gmail.com> wrote: I want to focus on this item: >> 2. I watched for the partial clone logic to kick in and download blobs. >> Some of these were inevitable: we need the blobs to resolve edit/edit >> conflicts. Most cases none were downloaded at all, so this series is >> working as advertised. There _was_ a case where the inexact rename >> detection requested a large list of files (~2900 in three batches) but >> _then_ said "inexact rename detection was skipped due to too many >> files". This is a case that would be nice to resolve in this series. I >> will try to find exactly where in the code this is being triggered and >> report back. > > This suggests perhaps that EITHER there was a real modify/delete > conflict (because you have to do full rename detection to rule out > that the modify/delete was part of some rename), OR that there was a > renamed file modified on both sides that did not keep its original > basename (because that combination is needed to bypass the various > optimizations and make it fall back to full inexact rename detection). > Further, in either case, there were enough adds/deletes that full > inexact detection is still a bit expensive. It'd be interesting to > know which case it was. What happens if you set merge.renameLimit to > something higher (the default is surprisingly small)? The behavior I'd like to see is that the partial clone logic is not run if we are going to download more than merge.renameLimit files. Whatever is getting these missing blobs is earlier than the limit check, but it should be after instead. It's particularly problematic that Git does all the work to get the blobs, but then gives up and doesn't even use them for rename detection. I'm happy that we download necessary blobs when there are a few dozen files that need inexact renames. When it gets into the thousands, then we jump into a different category of user experience. Having a stop-gap of rename detection limits is an important way to avoid huge amounts of file downloads in these huge repo cases. Users can always opt into a larger limit if they really do want that rename detection to work at such a large scale, but we still need protections for the vast majority of cases where a user isn't willing to pay the cost of downloading these blobs. Thanks, -Stolee
On Tue, Jun 22, 2021 at 7:14 PM Derrick Stolee <stolee@gmail.com> wrote: > > On 6/22/2021 2:45 PM, Elijah Newren wrote: > > On Tue, Jun 22, 2021 at 9:10 AM Derrick Stolee <stolee@gmail.com> wrote: > > I want to focus on this item: > > >> 2. I watched for the partial clone logic to kick in and download blobs. > >> Some of these were inevitable: we need the blobs to resolve edit/edit > >> conflicts. Most cases none were downloaded at all, so this series is > >> working as advertised. There _was_ a case where the inexact rename > >> detection requested a large list of files (~2900 in three batches) but > >> _then_ said "inexact rename detection was skipped due to too many > >> files". This is a case that would be nice to resolve in this series. I > >> will try to find exactly where in the code this is being triggered and > >> report back. > > > > This suggests perhaps that EITHER there was a real modify/delete > > conflict (because you have to do full rename detection to rule out > > that the modify/delete was part of some rename), OR that there was a > > renamed file modified on both sides that did not keep its original > > basename (because that combination is needed to bypass the various > > optimizations and make it fall back to full inexact rename detection). > > Further, in either case, there were enough adds/deletes that full > > inexact detection is still a bit expensive. It'd be interesting to > > know which case it was. What happens if you set merge.renameLimit to > > something higher (the default is surprisingly small)? > > The behavior I'd like to see is that the partial clone logic is not > run if we are going to download more than merge.renameLimit files. > Whatever is getting these missing blobs is earlier than the limit > check, but it should be after instead. > > It's particularly problematic that Git does all the work to get the > blobs, but then gives up and doesn't even use them for rename > detection. I agree with what should happen, but I'm surprised it's not already happening. The give up check comes from too_many_rename_candidates(), which is called before dpf_options.missing_object_cb is even set to inexact_prefetch. So I'm not sure how the fetching comes first. Is there a microsoft specific patch that changes the order somehow? Is there something I'm mis-reading in the code? > I'm happy that we download necessary blobs when there are a few > dozen files that need inexact renames. When it gets into the > thousands, then we jump into a different category of user experience. > > Having a stop-gap of rename detection limits is an important way to > avoid huge amounts of file downloads in these huge repo cases. Users > can always opt into a larger limit if they really do want that rename > detection to work at such a large scale, but we still need protections > for the vast majority of cases where a user isn't willing to pay the > cost of downloading these blobs. Sure, that's fair. But I'm still curious what the particular shape is for the data in question. What does the error say merge.renameLimit would be need to set to? If it's set higher, do some of the files resolve nicely (i.e. they really were renames modified on both sides with a different basename), or are they modify/delete conflicts and we're paying the cost of rename detection to verify they were deleted and we really do have a conflict? I'm curious if there's more to learn and more optimization potential, basically. Your repository is bigger, so there may be more to learn from it than from the testcases I've tried so far. :-)
On Wed, Jun 23, 2021 at 1:11 AM Elijah Newren <newren@gmail.com> wrote: > > On Tue, Jun 22, 2021 at 7:14 PM Derrick Stolee <stolee@gmail.com> wrote: > > > > On 6/22/2021 2:45 PM, Elijah Newren wrote: > > > On Tue, Jun 22, 2021 at 9:10 AM Derrick Stolee <stolee@gmail.com> wrote: > > > > I want to focus on this item: > > > > >> 2. I watched for the partial clone logic to kick in and download blobs. > > >> Some of these were inevitable: we need the blobs to resolve edit/edit > > >> conflicts. Most cases none were downloaded at all, so this series is > > >> working as advertised. There _was_ a case where the inexact rename > > >> detection requested a large list of files (~2900 in three batches) but > > >> _then_ said "inexact rename detection was skipped due to too many > > >> files". This is a case that would be nice to resolve in this series. I > > >> will try to find exactly where in the code this is being triggered and > > >> report back. > > > > > > This suggests perhaps that EITHER there was a real modify/delete > > > conflict (because you have to do full rename detection to rule out > > > that the modify/delete was part of some rename), OR that there was a > > > renamed file modified on both sides that did not keep its original > > > basename (because that combination is needed to bypass the various > > > optimizations and make it fall back to full inexact rename detection). > > > Further, in either case, there were enough adds/deletes that full > > > inexact detection is still a bit expensive. It'd be interesting to > > > know which case it was. What happens if you set merge.renameLimit to > > > something higher (the default is surprisingly small)? > > > > The behavior I'd like to see is that the partial clone logic is not > > run if we are going to download more than merge.renameLimit files. > > Whatever is getting these missing blobs is earlier than the limit > > check, but it should be after instead. > > > > It's particularly problematic that Git does all the work to get the > > blobs, but then gives up and doesn't even use them for rename > > detection. > > I agree with what should happen, but I'm surprised it's not already > happening. The give up check comes from too_many_rename_candidates(), > which is called before dpf_options.missing_object_cb is even set to > inexact_prefetch. So I'm not sure how the fetching comes first. Is > there a microsoft specific patch that changes the order somehow? Is > there something I'm mis-reading in the code? After thinking about it more, I wonder if the following is what is happening: * Some directory was renamed (and likely not a leaf), resulting in a large pile of renames (or some other change that gives lots of renames without changing basename) * basename_prefetch() kicks in to download files whose basenames have changed (even for "irrelevant" renames) * Basename matching is performed (which is linear in the number of renames, and not subject to the merge.renameLimit) * After basename matching, there are still unmatched destinations and relevant sources; which would need to be handled by the quadratic inexact matching algorithm. * Since there are enough renames left to trigger the renameLimit, you get the "inexact rename detection was skipped due to too many files" warning. and then you assumed the prefetching was for the quadratic inexact rename detection when in reality it was just for the linear basename matching. If so, there's a couple things we could do here: * The easy change: modify the warning, perhaps to something like "quadratic inexact rename detection was skipped...", to make it better reflect its meaning (basename matching is also an inexact match) * The more complicated change: be more aggressive about not detecting renames via basename match for "irrelevant" sources...perhaps avoiding it when it involves prefetching, or maybe just when we can somehow determine that we'd bail on the quadratic inexact rename detection anyway. The second point perhaps deserves a bit more explanation. Basename matching has an advantage of removing both sources and destinations from the later quadratic (O(N*M)) inexact rename detection, so it was generally beneficial to just always do the basename matching even if the path wasn't relevant for either content or location reasons. But that was presuming later quadratic inexact rename detection was going to run and not be skipped. It was one of those tradeoffs in optimization. But if the file hasn't been prefetched and we're likely to skip the quadratic inexact rename detection, then trying to do basename matching for an irrelevant rename is wasted work, as is prefetching the blobs in order to be able to detect that irrelevant rename. But how do we know if we'll skip the quadratic rename detection if we don't match the basenames that we can, since every matched basename removes both a source and a destination from the unmatched pairs? Maybe we should reorder the basename matching as a two-pass algorithm, where we first detect basename-matching renames for all relevant sources, and then repeat for non-relevant sources. That would also allow us to insert a check before the second pass, where if the first pass removed all relevant sources (meaning both that no relevant sources were left out of basename matching and we found a match for every relevant source included in basename matching), then we can exit without doing the second pass. That might even improve the performance in cases without prefetching. But it would mean adding another prefetch step. After doing the above splitting, then we could add extra conditions before the second step, such as bailing on it if we think we'd bail on quadratic inexact rename detection (which may still be hard to guess). Or maybe even just bailing on the second step if prefetching would be involved, because we'd rather lump those all into the quadratic inexact rename detection anyway. > But I'm still curious what the particular shape is for the data in > question. What does the error say merge.renameLimit would be need to > set to? If it's set higher, do some of the files resolve nicely (i.e. > they really were renames modified on both sides with a different > basename), or are they modify/delete conflicts and we're paying the > cost of rename detection to verify they were deleted and we really do > have a conflict? I'm curious if there's more to learn and more > optimization potential, basically. Your repository is bigger, so > there may be more to learn from it than from the testcases I've tried > so far. :-) We might have just identified multiple additional optimization opportunities above, neither of which is included in my pending optimization series. It would be helpful to get more details about how frequently these kind of cases occur, the particular renameLimit in use, the number of paths involved (number of unmatched source and destinations), how many of the sources are relevant (perhaps even broken down into content relevant and location relevant), etc., as these all may help inform the implementation and whatever tests we want to add to the testsuite. However, some of that info is currently hard to gather. I could probably start by adding some trace2 statistics to diffcore-rename to print out the original number of sources and destinations, the number matched via exact matching, the number matched via basename matching, the number of sources removed due to being irrelevant, and anything else I might be overlooking at the moment to help gather relevant data.