From patchwork Sat Feb 6 22:52:15 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Elijah Newren X-Patchwork-Id: 12072391 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9C466C433E0 for ; Sat, 6 Feb 2021 22:53:10 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 54E4664E34 for ; Sat, 6 Feb 2021 22:53:10 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229618AbhBFWxD (ORCPT ); Sat, 6 Feb 2021 17:53:03 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44038 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229537AbhBFWxB (ORCPT ); Sat, 6 Feb 2021 17:53:01 -0500 Received: from mail-wr1-x42c.google.com (mail-wr1-x42c.google.com [IPv6:2a00:1450:4864:20::42c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 85B40C061756 for ; Sat, 6 Feb 2021 14:52:21 -0800 (PST) Received: by mail-wr1-x42c.google.com with SMTP id b3so12515823wrj.5 for ; Sat, 06 Feb 2021 14:52:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:fcc :content-transfer-encoding:mime-version:to:cc; bh=4Z3xowqJGcAVTcCt9pLILej9TBFmNpMdDHO/l+AiqLI=; b=kkwlsMFnSSROu0l4a3ixbPAii/W3SB3R7aSrjXV0K+A1aekMlvrzZNvHwSNNrsSNaf Ir4woqZkuiXUsqnpPkd+GWx8m28kRYSbPGhFY6W1vWXF2a2xsPWLP/XlI2+3ba0TP7kH PgZMhFdy9VwA2pJ8fiEWSfhl2h1e1RhgX+E/pg8/801T5VZxyjia4sxRlCTVC8KAZtyT HABbcmPC+xQ+5FO7NJOogGeJwiVO+R3lJy+NA2hXVgNH15y4JNyppJcZTOrsgpXza0jJ 0ZjGg+ulF73uYKcvs7KgwJHZWL0daYNcDcSeFN6+zKp2hkdpkqtN4UE+lLovZV0OMolv DGkQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:fcc:content-transfer-encoding:mime-version:to:cc; bh=4Z3xowqJGcAVTcCt9pLILej9TBFmNpMdDHO/l+AiqLI=; b=CzaauW/2OCMl86jHkMe39umPxVHtYjtUUGvr0k7XwZtH1MM3AWntIpxYLH4YJ1J4uI Mg1rF3gCfWs8RRS1tPkq8a/K2efRuvMEtXM6CP/mEQHxWL1Huezumkfp5n2oCfgh6Ggk MhWSL+56KBNWb7Hss380Ysk6klFn2GiPqFrlORvwnGbD6KVTKKaec/GtKDIgmByhtB02 3oZ8YwAUGFTeC2XImggChZKNKiB5r2+I17UxI8C5/Yyy1ibk9FO0FcH/2ezrWCqwL7gv Ey3p02gswmB1C5AvxCK6a2PZL2nndYP5V2AL3rsSsMnDZJDCWW06k/flFYzDmQaIYYJd UPSg== X-Gm-Message-State: AOAM533uZwE0oJAUCoJAHZxUjepREKS0RYx2pd8ytwxNOIQKuKRnr7W3 xpGhgkRqsXtRcZu2M1DniZLJy8NF1CM= X-Google-Smtp-Source: ABdhPJwYudP8xa5+Mpttv5cv4vDtEtZTYrpXlRaDDsYDVFX+MRe+oa3TaaXP9mlSngiFoiQnZjZqFg== X-Received: by 2002:adf:decf:: with SMTP id i15mr12126047wrn.405.1612651940121; Sat, 06 Feb 2021 14:52:20 -0800 (PST) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id i18sm19090110wrn.29.2021.02.06.14.52.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 06 Feb 2021 14:52:19 -0800 (PST) Message-Id: <377a4a39fa864a09eb1d90e1addcc5d61d6b026c.1612651937.git.gitgitgadget@gmail.com> In-Reply-To: References: Date: Sat, 06 Feb 2021 22:52:15 +0000 Subject: [PATCH 1/3] diffcore-rename: compute basenames of all source and dest candidates Fcc: Sent MIME-Version: 1.0 To: git@vger.kernel.org Cc: Derrick Stolee , Jonathan Tan , Taylor Blau , Junio C Hamano , Jeff King , Elijah Newren , Elijah Newren Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Elijah Newren From: Elijah Newren We want to make use of unique basenames to help inform rename detection, so that more likely pairings can be checked first. Add a new function, not yet used, which creates a map of the unique basenames within rename_src and another within rename_dst, together with the indices within rename_src/rename_dst where those basenames show up. Non-unique basenames still show up in the map, but have an invalid index (-1). Signed-off-by: Elijah Newren --- diffcore-rename.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 53 insertions(+) diff --git a/diffcore-rename.c b/diffcore-rename.c index 74930716e70d..1c52077b04e5 100644 --- a/diffcore-rename.c +++ b/diffcore-rename.c @@ -367,6 +367,59 @@ static int find_exact_renames(struct diff_options *options) return renames; } +MAYBE_UNUSED +static int find_basename_matches(struct diff_options *options, + int minimum_score, + int num_src) +{ + int i; + struct strintmap sources; + struct strintmap dests; + + /* Create maps of basename -> fullname(s) for sources and dests */ + strintmap_init_with_options(&sources, -1, NULL, 0); + strintmap_init_with_options(&dests, -1, NULL, 0); + for (i = 0; i < num_src; ++i) { + char *filename = rename_src[i].p->one->path; + char *base; + + /* exact renames removed in remove_unneeded_paths_from_src() */ + assert(!rename_src[i].p->one->rename_used); + + base = strrchr(filename, '/'); + base = (base ? base+1 : filename); + + /* Record index within rename_src (i) if basename is unique */ + if (strintmap_contains(&sources, base)) + strintmap_set(&sources, base, -1); + else + strintmap_set(&sources, base, i); + } + for (i = 0; i < rename_dst_nr; ++i) { + char *filename = rename_dst[i].p->two->path; + char *base; + + if (rename_dst[i].is_rename) + continue; /* involved in exact match already. */ + + base = strrchr(filename, '/'); + base = (base ? base+1 : filename); + + /* Record index within rename_dst (i) if basename is unique */ + if (strintmap_contains(&dests, base)) + strintmap_set(&dests, base, -1); + else + strintmap_set(&dests, base, i); + } + + /* TODO: Make use of basenames source and destination basenames */ + + strintmap_clear(&sources); + strintmap_clear(&dests); + + return 0; +} + #define NUM_CANDIDATE_PER_DST 4 static void record_if_better(struct diff_score m[], struct diff_score *o) { From patchwork Sat Feb 6 22:52:16 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Elijah Newren X-Patchwork-Id: 12072395 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B789CC433DB for ; Sat, 6 Feb 2021 22:53:22 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 7BC6D64E34 for ; Sat, 6 Feb 2021 22:53:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229650AbhBFWxG (ORCPT ); Sat, 6 Feb 2021 17:53:06 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44056 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229629AbhBFWxF (ORCPT ); Sat, 6 Feb 2021 17:53:05 -0500 Received: from mail-wm1-x32e.google.com (mail-wm1-x32e.google.com [IPv6:2a00:1450:4864:20::32e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E6340C061788 for ; Sat, 6 Feb 2021 14:52:24 -0800 (PST) Received: by mail-wm1-x32e.google.com with SMTP id i5so1160269wmq.2 for ; Sat, 06 Feb 2021 14:52:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:fcc :content-transfer-encoding:mime-version:to:cc; bh=xVW8AYkO+CVDw+SAvP5by0Z6h9oLqD7qr93LMHFcLQk=; b=MmnK4QPrwlnBG4QQeOjYqYSwTMj/2jE61DbuMyZTa3Q0elqlcVdgFx5Fk8ZjxIWZfp 3WY5jpmZzoaoH8p4Q8KSmuU0j14S/7XpIj1mx8BdVziOQf/ENV+sNAqT33gJJhtbKHWb F86kG54NSywpoYn0P5aihAErqHDuVffh6aRPIhUuDpU2EUFxTy2yJyxSal8yjjtQGUZ/ JbKN7+DQnViwep7NxYcHqHK2xBS/wxQZZE8YSuoqMdCjzI04cTEzt2R8roeDwgWva0PT a70sJ+QniwtY/eWy0dm3EpwBvkpipNP2CaUfLBAfXdf+yoQ9HjteGS7n7Xxs6zo23I4N pC7Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:fcc:content-transfer-encoding:mime-version:to:cc; bh=xVW8AYkO+CVDw+SAvP5by0Z6h9oLqD7qr93LMHFcLQk=; b=i1wuo+U72ultJl/RaNWvF1g3oDwMsgaHdE4FkCfg/xLcKAQHEvGXjU5/srRZyS1575 ePONRy2XB1Cb0ry3dDz97nCmBSntIw6lv04Sn34SA2tsUldSCztdqU2iqKBpmgUXR4gW kYwjWPaG9wz0RED9kMrc89JCWsn9IYeMhBPx5wwC305CWkfHCfzU1CG/291Q4wsREGCo hCue6UGuHKBJ28wfpN9kl4EQ5pc9KMdkxeIFFk2pSTeWAg7N7g6IAT5/Ilh3NYS8flHV oIqLKgkvCxgQONNu+66VbjVsoN56MDr2X19b4bNOLZXeb4PE+ja2kFfUFqfPQ2zFy0+D dXrw== X-Gm-Message-State: AOAM531Y3cC4sbsOgPnD8oCFIJhHY2SyEltPWd4pRIILU1Bs10uuLO4/ HuyQilAZE4RpyXL1cskO24jlqIPutQU= X-Google-Smtp-Source: ABdhPJzLfevJSJkHnfvNeeBqWnYoGIXJMe0l1DAMiLn+0imAiGon3nKyjkErWkAWxuxbUYO6TnuZeg== X-Received: by 2002:a1c:6208:: with SMTP id w8mr8730829wmb.99.1612651941084; Sat, 06 Feb 2021 14:52:21 -0800 (PST) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id t7sm3128564wrv.75.2021.02.06.14.52.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 06 Feb 2021 14:52:20 -0800 (PST) Message-Id: <5fb4493247ff4676abd11eb47b1db3111e043fb5.1612651937.git.gitgitgadget@gmail.com> In-Reply-To: References: Date: Sat, 06 Feb 2021 22:52:16 +0000 Subject: [PATCH 2/3] diffcore-rename: complete find_basename_matches() Fcc: Sent MIME-Version: 1.0 To: git@vger.kernel.org Cc: Derrick Stolee , Jonathan Tan , Taylor Blau , Junio C Hamano , Jeff King , Elijah Newren , Elijah Newren Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Elijah Newren From: Elijah Newren It is not uncommon in real world repositories for the majority of file renames to not change the basename of the file; i.e. most "renames" are just a move of files into different directories. We can make use of this to avoid comparing all rename source candidates with all rename destination candidates, by first comparing sources to destinations with the same basenames. If two files with the same basename are sufficiently similar, we record the rename; if not, we include those files in the more exhaustive matrix comparison. Note that this optimization might give us different results than without the optimization, because it's possible that despite files with the same basename being sufficiently similar to be considered a rename, there's an even better match between files without the same basename. I think that is okay for four reasons: (1) That seems somewhat unlikely in practice, (2) it's easy to explain to the users what happened if it does ever occur (or even for them to intuitively figure out), and (3) as the next patch will show it provides such a large performance boost that it's worth the tradeoff. Reason (4) takes a full paragraph to explain... If the previous three reasons aren't enough, consider what rename detection already does. Break detection is not the default, meaning that if files have the same _fullname_, then they are considered related even if they are 0% similar. In fact, in such a case, we don't even bother comparing the files to see if they are similar let alone comparing them to all other files to see what they are most similar to. Basically, we override content similarity based on sufficient filename similarity. Without the filename similarity (currently implemented as an exact match of filename), we swing the pendulum the opposite direction and say that filename similarity is irrelevant and compare a full N x M matrix of sources and destinations to find out which have the most similar contents. This optimization just adds another form of filename similarity comparison, but augments it with a file content similarity check as well. Basically, if two files have the same basename and are sufficiently similar to be considered a rename, mark them as such without comparing the two to all other rename candidates. We do not use this heuristic together with either break or copy detection. The point of break detection is to say that filename similarity does not imply file content similarity, and we only want to know about file content similarity. The point of copy detection is to use more resources to check for additional similarities, while this is an optimization that uses far less resources but which might also result in finding slightly fewer similarities. So the idea behind this optimization goes against both of those features, and will be turned off for both. Note that this optimization is not yet effective in any situation, as the function is still unused. The next commit will hook it into the code so that it is used when rename detection is wanted, but neither copy nor break detection are. Signed-off-by: Elijah Newren --- diffcore-rename.c | 94 +++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 91 insertions(+), 3 deletions(-) diff --git a/diffcore-rename.c b/diffcore-rename.c index 1c52077b04e5..b1dda41de9b1 100644 --- a/diffcore-rename.c +++ b/diffcore-rename.c @@ -372,10 +372,48 @@ static int find_basename_matches(struct diff_options *options, int minimum_score, int num_src) { - int i; + /* + * When I checked, over 76% of file renames in linux just moved + * files to a different directory but kept the same basename. gcc + * did that with over 64% of renames, gecko did it with over 79%, + * and WebKit did it with over 89%. + * + * Therefore we can bypass the normal exhaustive NxM matrix + * comparison of similarities between all potential rename sources + * and destinations by instead using file basename as a hint, checking + * for similarity between files with the same basename, and if we + * find a pair that are sufficiently similar, record the rename + * pair and exclude those two from the NxM matrix. + * + * This *might* cause us to find a less than optimal pairing (if + * there is another file that we are even more similar to but has a + * different basename). Given the huge performance advantage + * basename matching provides, and given the frequency with which + * people use the same basename in real world projects, that's a + * trade-off we are willing to accept when doing just rename + * detection. However, if someone wants copy detection that + * implies they are willing to spend more cycles to find + * similarities between files, so it may be less likely that this + * heuristic is wanted. + */ + + int i, renames = 0; struct strintmap sources; struct strintmap dests; + /* + * The prefeteching stuff wants to know if it can skip prefetching blobs + * that are unmodified. unmodified blobs are only relevant when doing + * copy detection. find_basename_matches() is only used when detecting + * renames, not when detecting copies, so it'll only be used when a file + * only existed in the source. Since we already know that the file + * won't be unmodified, there's no point checking for it; that's just a + * waste of resources. So set skip_unmodified to 0 so that + * estimate_similarity() and prefetch() won't waste resources checking + * for something we already know is false. + */ + int skip_unmodified = 0; + /* Create maps of basename -> fullname(s) for sources and dests */ strintmap_init_with_options(&sources, -1, NULL, 0); strintmap_init_with_options(&dests, -1, NULL, 0); @@ -412,12 +450,62 @@ static int find_basename_matches(struct diff_options *options, strintmap_set(&dests, base, i); } - /* TODO: Make use of basenames source and destination basenames */ + /* Now look for basename matchups and do similarity estimation */ + for (i = 0; i < num_src; ++i) { + char *filename = rename_src[i].p->one->path; + char *base = NULL; + intptr_t src_index; + intptr_t dst_index; + + /* Get the basename */ + base = strrchr(filename, '/'); + base = (base ? base+1 : filename); + + /* Find out if this basename is unique among sources */ + src_index = strintmap_get(&sources, base); + if (src_index == -1) + continue; /* not a unique basename; skip it */ + assert(src_index == i); + + if (strintmap_contains(&dests, base)) { + struct diff_filespec *one, *two; + int score; + + /* Find out if this basename is unique among dests */ + dst_index = strintmap_get(&dests, base); + if (dst_index == -1) + continue; /* not a unique basename; skip it */ + + /* Ignore this dest if already used in a rename */ + if (rename_dst[dst_index].is_rename) + continue; /* already used previously */ + + /* Estimate the similarity */ + one = rename_src[src_index].p->one; + two = rename_dst[dst_index].p->two; + score = estimate_similarity(options->repo, one, two, + minimum_score, skip_unmodified); + + /* If sufficiently similar, record as rename pair */ + if (score < minimum_score) + continue; + record_rename_pair(dst_index, src_index, score); + renames++; + + /* + * Found a rename so don't need text anymore; if we + * didn't find a rename, the filespec_blob would get + * re-used when doing the matrix of comparisons. + */ + diff_free_filespec_blob(one); + diff_free_filespec_blob(two); + } + } strintmap_clear(&sources); strintmap_clear(&dests); - return 0; + return renames; } #define NUM_CANDIDATE_PER_DST 4 From patchwork Sat Feb 6 22:52:17 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Elijah Newren X-Patchwork-Id: 12072397 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4DEC8C433E0 for ; Sat, 6 Feb 2021 22:53:31 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 1B0B464E27 for ; Sat, 6 Feb 2021 22:53:31 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229646AbhBFWxF (ORCPT ); Sat, 6 Feb 2021 17:53:05 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44048 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229601AbhBFWxD (ORCPT ); Sat, 6 Feb 2021 17:53:03 -0500 Received: from mail-wr1-x436.google.com (mail-wr1-x436.google.com [IPv6:2a00:1450:4864:20::436]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 28F50C061786 for ; Sat, 6 Feb 2021 14:52:23 -0800 (PST) Received: by mail-wr1-x436.google.com with SMTP id c12so12497135wrc.7 for ; Sat, 06 Feb 2021 14:52:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:mime-version :content-transfer-encoding:fcc:to:cc; bh=PGRjQzl9BkG3yv+jV8uKDjIRk9MizOHpwOuVccr+YAI=; b=tHYNuIyjJU8lc7ySo+UnCWfsJtjY8+kgKarKX8NO8RfS2fPtJ5eiBWztIQ7i5uoq+n tJaucUdYfp5jb9+RLIvcVqAmUeCfEAFe52637wIff/m+42pys9O8cCDxNsbs4gx9G0M8 5XhG4nlNGVo9SOL6WlmFO0NEI53pkMthmYsei0KhwBVezcaqMsC9dy6aB0iE2CbkTbMV 60FLZoZ++5JlviUc1xptknda3Avku4hTZ7pUOuAuMV97DY5o87c4d5kzRhlI9IGU99uG M3+tjfKuTGU+0ck7kC/BH5lFR3YiLUF2PeQVeDej7rwxsDE9FEqK7uv74YbEaZL7e5rR tZsg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:mime-version:content-transfer-encoding:fcc:to:cc; bh=PGRjQzl9BkG3yv+jV8uKDjIRk9MizOHpwOuVccr+YAI=; b=EfNYWdUVgRqaVosH/C8xuDD4bWLiCv326qb/PvQnS9ECgO/GK9NcuPX9yiujRutzD+ 3WczcfK7h2+i0t7McU2YffPM3OhQv9lC5Vaey7NS/dInIQGG9q+TWqSJLBtz+KOp+P02 Z0IsSUlTn7yLlDgppgyKIvl9veWn1GalN1NqjaXovC/iOMMiJQ3CsXfxQ9bCif9iWxcw QiaeY0zg/v9ZsHHjw0KRVzhatL77NUdUucMllFImkK0jBg0kNIDySJN4mB+uEw1YHYrG vg3/svZiSW2V/DQ/mFpuXGLrgYE7bCywTZGcdTlLZmlqYUXlK930dAjIDic+w9jslVf1 WtEg== X-Gm-Message-State: AOAM5317MKJ9okS5/WrGjPtLKuGRvMO5PKGKJiktxLAfboFiIXRJPl2j 8GMdC2uQ+m4+Exx+/7kf33TrVNmhboQ= X-Google-Smtp-Source: ABdhPJxMO8EA465NwAwbmW1FnpI5fx2fp/sj5msMdT7FwJd5p57EsyLUWdwtm4TovILfA+YYG9pwPg== X-Received: by 2002:adf:f2d1:: with SMTP id d17mr12460003wrp.110.1612651941770; Sat, 06 Feb 2021 14:52:21 -0800 (PST) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id l10sm18196827wro.4.2021.02.06.14.52.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 06 Feb 2021 14:52:21 -0800 (PST) Message-Id: <1d941c35076e8d515c8ff7ef01d6b9d8c092aaa9.1612651937.git.gitgitgadget@gmail.com> In-Reply-To: References: Date: Sat, 06 Feb 2021 22:52:17 +0000 Subject: [PATCH 3/3] diffcore-rename: guide inexact rename detection based on basenames MIME-Version: 1.0 Fcc: Sent To: git@vger.kernel.org Cc: Derrick Stolee , Jonathan Tan , Taylor Blau , Junio C Hamano , Jeff King , Elijah Newren , Elijah Newren Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Elijah Newren From: Elijah Newren Make use of the new find_basename_matches() function added in the last two patches, to find renames more rapidly in cases where we can match up files based on basenames. For the testcases mentioned in commit 557ac0350d ("merge-ort: begin performance work; instrument with trace2_region_* calls", 2020-10-28), this change improves the performance as follows: Before After no-renames: 13.815 s ± 0.062 s 13.138 s ± 0.086 s mega-renames: 1799.937 s ± 0.493 s 169.488 s ± 0.494 s just-one-mega: 51.289 s ± 0.019 s 5.061 s ± 0.017 s Signed-off-by: Elijah Newren --- diffcore-rename.c | 42 +++++++++++++++++++++++++++++++++++++----- 1 file changed, 37 insertions(+), 5 deletions(-) diff --git a/diffcore-rename.c b/diffcore-rename.c index b1dda41de9b1..206c0bbdcdfb 100644 --- a/diffcore-rename.c +++ b/diffcore-rename.c @@ -367,7 +367,6 @@ static int find_exact_renames(struct diff_options *options) return renames; } -MAYBE_UNUSED static int find_basename_matches(struct diff_options *options, int minimum_score, int num_src) @@ -718,12 +717,45 @@ void diffcore_rename(struct diff_options *options) if (minimum_score == MAX_SCORE) goto cleanup; + num_sources = rename_src_nr; + + if (want_copies || break_idx) { + /* + * Cull sources: + * - remove ones corresponding to exact renames + */ + trace2_region_enter("diff", "cull after exact", options->repo); + remove_unneeded_paths_from_src(want_copies); + trace2_region_leave("diff", "cull after exact", options->repo); + } else { + /* + * Cull sources: + * - remove ones involved in renames (found via exact match) + */ + trace2_region_enter("diff", "cull exact", options->repo); + remove_unneeded_paths_from_src(want_copies); + trace2_region_leave("diff", "cull exact", options->repo); + + /* Utilize file basenames to quickly find renames. */ + trace2_region_enter("diff", "basename matches", options->repo); + rename_count += find_basename_matches(options, minimum_score, + rename_src_nr); + trace2_region_leave("diff", "basename matches", options->repo); + + /* + * Cull sources, again: + * - remove ones involved in renames (found via basenames) + */ + trace2_region_enter("diff", "cull basename", options->repo); + remove_unneeded_paths_from_src(want_copies); + trace2_region_leave("diff", "cull basename", options->repo); + } + /* - * Calculate how many renames are left + * Calculate how many rename destinations are left */ num_destinations = (rename_dst_nr - rename_count); - remove_unneeded_paths_from_src(want_copies); - num_sources = rename_src_nr; + num_sources = rename_src_nr; /* rename_src_nr reflects lower number */ /* All done? */ if (!num_destinations || !num_sources) @@ -755,7 +787,7 @@ void diffcore_rename(struct diff_options *options) struct diff_score *m; if (rename_dst[i].is_rename) - continue; /* dealt with exact match already. */ + continue; /* exact or basename match already handled */ m = &mx[dst_cnt * NUM_CANDIDATE_PER_DST]; for (j = 0; j < NUM_CANDIDATE_PER_DST; j++)