From patchwork Tue Feb 9 11:32:03 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Elijah Newren X-Patchwork-Id: 12077851 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 30021C433DB for ; Tue, 9 Feb 2021 11:43:57 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id CF69B64E10 for ; Tue, 9 Feb 2021 11:43:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230382AbhBILmv (ORCPT ); Tue, 9 Feb 2021 06:42:51 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60460 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230256AbhBILhH (ORCPT ); Tue, 9 Feb 2021 06:37:07 -0500 Received: from mail-wm1-x335.google.com (mail-wm1-x335.google.com [IPv6:2a00:1450:4864:20::335]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9F788C0617A9 for ; Tue, 9 Feb 2021 03:32:10 -0800 (PST) Received: by mail-wm1-x335.google.com with SMTP id t142so2772118wmt.1 for ; Tue, 09 Feb 2021 03:32:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:fcc :content-transfer-encoding:mime-version:to:cc; bh=kJC18C9UjNdFU+IHHbkveIfCe/0q0qPAvaWxxOluuho=; b=dh7dyT0o9QALfJ/6YXVwlIhlPjUmlLw/2Aoc83o9XHcZt02ca2yFc3YhDgerPBAyE0 Rwmq38Y7HaUBQKwgAKKcuqhSy2cOKpPo3WVSB3xTuQukxs89CufoO2Mzk0aoEiVPxelL eLEwDCWiCwc8UQWXKJp6GpoAQmop0a2669BIVduOrrR8xnsD01GvtcRTPK+XZfb6k7Jf uBhOdxS8U/3YB01MLC0DV05sHeckL2fDCoUDWgGW0cO+tJ3+B/b4Q0VWYt6r+LTyG8Ol f+hx94z/Mh2qMK6wH6DMn+6m7ymwghiMPZ2fiNibIKyupmQsN2Pb+LgheP3zecbr6jY2 u5+A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:fcc:content-transfer-encoding:mime-version:to:cc; bh=kJC18C9UjNdFU+IHHbkveIfCe/0q0qPAvaWxxOluuho=; b=qFot5CRwwytjiiX5MGHhxALnn24RDgqpKR1uW4EfCoQm7oXa7TR7x51VXHuFh6fcT6 gKyKyvYNB7xdXOrDX3Ursf6UztONnqpD+G1Z18U4qKvaseEmRLGiHlm+Tjg7sfUsM7Yh ExPbvgq3NoDMcyCNpWSzb8Z04TT8Pd+s5Jh21vtjBoS1lPl3j50voVXmT0QfAu9HNHqZ YYPXZigVzaXM/v2xZBEMbPnRSJp3rc/sHel6XgwuqYMlHWAT6aLDTQmoY4HeLs3lLNsb h8R9Mn19i2AGq6EKiJS6yuqjLtAg0iHaQCxekQAzIBPwzA2HIu6tQ4CCd5szBNuGsgY1 bgdw== X-Gm-Message-State: AOAM5311K38j6nM7lNgq5uGXVCRtOD+XgJqD0+q8DIDG7rHRvGYdr7JZ nbwAPLaXQxJcb3FEMZAzZ9xi0ozmnKs= X-Google-Smtp-Source: ABdhPJzl6ObhBwA7ycfztB/+vjIyEEaCH7hVTd/yUt1RSqPZKvwstqcWXTQDlWxrBEvLNGcqgNF2TA== X-Received: by 2002:a7b:cb05:: with SMTP id u5mr2982843wmj.140.1612870329155; Tue, 09 Feb 2021 03:32:09 -0800 (PST) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id f2sm2334909wrt.7.2021.02.09.03.32.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 09 Feb 2021 03:32:08 -0800 (PST) Message-Id: <381a45d239bb52a70373c385d8978005c9cb4800.1612870326.git.gitgitgadget@gmail.com> In-Reply-To: References: Date: Tue, 09 Feb 2021 11:32:03 +0000 Subject: [PATCH v2 1/4] diffcore-rename: compute basenames of all source and dest candidates Fcc: Sent MIME-Version: 1.0 To: git@vger.kernel.org Cc: Derrick Stolee , Jonathan Tan , Taylor Blau , Junio C Hamano , Jeff King , Elijah Newren , Derrick Stolee , Elijah Newren , Elijah Newren Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Elijah Newren From: Elijah Newren We want to make use of unique basenames to help inform rename detection, so that more likely pairings can be checked first. (src/moduleA/foo.txt and source/module/A/foo.txt are likely related if there are no other 'foo.txt' files among the deleted and added files.) Add a new function, not yet used, which creates a map of the unique basenames within rename_src and another within rename_dst, together with the indices within rename_src/rename_dst where those basenames show up. Non-unique basenames still show up in the map, but have an invalid index (-1). This function was inspired by the fact that in real world repositories, most renames often do not involve a basename change. Here are some sample repositories and the percentage of their historical renames (as of early 2020) that did not involve a basename change: * linux: 76% * gcc: 64% * gecko: 79% * webkit: 89% Signed-off-by: Elijah Newren --- diffcore-rename.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 53 insertions(+) diff --git a/diffcore-rename.c b/diffcore-rename.c index 74930716e70d..1c52077b04e5 100644 --- a/diffcore-rename.c +++ b/diffcore-rename.c @@ -367,6 +367,59 @@ static int find_exact_renames(struct diff_options *options) return renames; } +MAYBE_UNUSED +static int find_basename_matches(struct diff_options *options, + int minimum_score, + int num_src) +{ + int i; + struct strintmap sources; + struct strintmap dests; + + /* Create maps of basename -> fullname(s) for sources and dests */ + strintmap_init_with_options(&sources, -1, NULL, 0); + strintmap_init_with_options(&dests, -1, NULL, 0); + for (i = 0; i < num_src; ++i) { + char *filename = rename_src[i].p->one->path; + char *base; + + /* exact renames removed in remove_unneeded_paths_from_src() */ + assert(!rename_src[i].p->one->rename_used); + + base = strrchr(filename, '/'); + base = (base ? base+1 : filename); + + /* Record index within rename_src (i) if basename is unique */ + if (strintmap_contains(&sources, base)) + strintmap_set(&sources, base, -1); + else + strintmap_set(&sources, base, i); + } + for (i = 0; i < rename_dst_nr; ++i) { + char *filename = rename_dst[i].p->two->path; + char *base; + + if (rename_dst[i].is_rename) + continue; /* involved in exact match already. */ + + base = strrchr(filename, '/'); + base = (base ? base+1 : filename); + + /* Record index within rename_dst (i) if basename is unique */ + if (strintmap_contains(&dests, base)) + strintmap_set(&dests, base, -1); + else + strintmap_set(&dests, base, i); + } + + /* TODO: Make use of basenames source and destination basenames */ + + strintmap_clear(&sources); + strintmap_clear(&dests); + + return 0; +} + #define NUM_CANDIDATE_PER_DST 4 static void record_if_better(struct diff_score m[], struct diff_score *o) { From patchwork Tue Feb 9 11:32:04 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Elijah Newren X-Patchwork-Id: 12077847 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4F935C433DB for ; Tue, 9 Feb 2021 11:42:54 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 0CDAF64E10 for ; Tue, 9 Feb 2021 11:42:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230340AbhBILmd (ORCPT ); Tue, 9 Feb 2021 06:42:33 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60462 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230229AbhBILfw (ORCPT ); Tue, 9 Feb 2021 06:35:52 -0500 Received: from mail-wr1-x430.google.com (mail-wr1-x430.google.com [IPv6:2a00:1450:4864:20::430]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CBC90C0617AA for ; Tue, 9 Feb 2021 03:32:11 -0800 (PST) Received: by mail-wr1-x430.google.com with SMTP id n6so8347391wrv.8 for ; Tue, 09 Feb 2021 03:32:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:fcc :content-transfer-encoding:mime-version:to:cc; bh=86fSnceyhV/8DTgGkIT35jFDjF2lBabCuua7RBFVi2U=; b=tYsQUs6mHwwakkarqysmPvjLVis1MqQuSblPmM0Gl5a7Xs5t6kR/r8zU0pLg8L1wwa FOTfypSvhQw3erkMQUzW9bjQU04By1dZTxesuTcW4i+HiEIYGibJX7oGl1zv29ekzPMT xUgLJcnAtaXkvh6DdL81AP18+qGM+ZFq+s0EFLPHTT40V1naMBHDOdrDd7PY4RQwyVc5 FdNg3cItcGn2YccDfBdfdsvohPahZ7QW4aKBayieo5oFYOdxv4yXp4Yz24g90dxr9t5Q vHN54rVQoAqIiOUWHEUZxCerTQkfViIVwGIrC1yyO/2RGJADPveMJAJfOHB5pW2htG8q H0PQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:fcc:content-transfer-encoding:mime-version:to:cc; bh=86fSnceyhV/8DTgGkIT35jFDjF2lBabCuua7RBFVi2U=; b=b2b7leDcvpJXnIxWnqhyuOrxTpQs3/zDTPEybPuzo1BKmgVWB38sRGppWvkElBEbHQ e2Opl6slmiMYuVpZN275ZWt8lx/SkzX9Nd9i5CH2XkbnfV4KGzmFdTIVXO9G3hAvLJqo 76jHdZSApOsw37i7kX36NppHQJAsxc4h31j85GMzwoO7NbVJ3Bsijeo+1RL3dsG4A3mX cKWh0P3kMw5wyFP+mz5AzW77tV4TLXMTRdpY8k4rkhe3/k15F8UkYy2yLFTTayMTItbJ Gd/cIHpM9a/B8NJbRxb45kkjVOqN8QFqVZ/CcBBFJiTIgg5hP4IkVCiqXwTbHU9agGKw GqzA== X-Gm-Message-State: AOAM530kiTmdU2xajPy/NZ/xkvdg0DUQHCdN0sHkoBr6OsPZ3tjcB0kg pzX/Zz+xGeRQnUjouOJRVgOlZFf562k= X-Google-Smtp-Source: ABdhPJwyUszkZkRay45lqkDYrb8EIW/EGkN5UpHHk/pMU4Li8pnrl7PTmSoG0lUL1U9ZHWoJhNE82g== X-Received: by 2002:a5d:58e7:: with SMTP id f7mr2131422wrd.114.1612870330288; Tue, 09 Feb 2021 03:32:10 -0800 (PST) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id o13sm4138086wmh.2.2021.02.09.03.32.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 09 Feb 2021 03:32:09 -0800 (PST) Message-Id: In-Reply-To: References: Date: Tue, 09 Feb 2021 11:32:04 +0000 Subject: [PATCH v2 2/4] diffcore-rename: complete find_basename_matches() Fcc: Sent MIME-Version: 1.0 To: git@vger.kernel.org Cc: Derrick Stolee , Jonathan Tan , Taylor Blau , Junio C Hamano , Jeff King , Elijah Newren , Derrick Stolee , Elijah Newren , Elijah Newren Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Elijah Newren From: Elijah Newren It is not uncommon in real world repositories for the majority of file renames to not change the basename of the file; i.e. most "renames" are just a move of files into different directories. We can make use of this to avoid comparing all rename source candidates with all rename destination candidates, by first comparing sources to destinations with the same basenames. If two files with the same basename are sufficiently similar, we record the rename; if not, we include those files in the more exhaustive matrix comparison. This means we are adding a set of preliminary additional comparisons, but for each file we only compare it with at most one other file. For example, if there was a include/media/device.h that was deleted and a src/module/media/device.h that was added, and there were no other device.h files added or deleted between the commits being compared, then these two files would be compared in the preliminary step. This commit does not yet actually employ this new optimization, it merely adds a function which can be used for this purpose. The next commit will do the necessary plumbing to make use of it. Note that this optimization might give us different results than without the optimization, because it's possible that despite files with the same basename being sufficiently similar to be considered a rename, there's an even better match between files without the same basename. I think that is okay for four reasons: (1) it's easy to explain to the users what happened if it does ever occur (or even for them to intuitively figure out), (2) as the next patch will show it provides such a large performance boost that it's worth the tradeoff, and (3) it's somewhat unlikely that despite having unique matching basenames that other files serve as better matches. Reason (4) takes a full paragraph to explain... If the previous three reasons aren't enough, consider what rename detection already does. Break detection is not the default, meaning that if files have the same _fullname_, then they are considered related even if they are 0% similar. In fact, in such a case, we don't even bother comparing the files to see if they are similar let alone comparing them to all other files to see what they are most similar to. Basically, we override content similarity based on sufficient filename similarity. Without the filename similarity (currently implemented as an exact match of filename), we swing the pendulum the opposite direction and say that filename similarity is irrelevant and compare a full N x M matrix of sources and destinations to find out which have the most similar contents. This optimization just adds another form of filename similarity comparison, but augments it with a file content similarity check as well. Basically, if two files have the same basename and are sufficiently similar to be considered a rename, mark them as such without comparing the two to all other rename candidates. Signed-off-by: Elijah Newren --- diffcore-rename.c | 94 +++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 91 insertions(+), 3 deletions(-) diff --git a/diffcore-rename.c b/diffcore-rename.c index 1c52077b04e5..b1dda41de9b1 100644 --- a/diffcore-rename.c +++ b/diffcore-rename.c @@ -372,10 +372,48 @@ static int find_basename_matches(struct diff_options *options, int minimum_score, int num_src) { - int i; + /* + * When I checked, over 76% of file renames in linux just moved + * files to a different directory but kept the same basename. gcc + * did that with over 64% of renames, gecko did it with over 79%, + * and WebKit did it with over 89%. + * + * Therefore we can bypass the normal exhaustive NxM matrix + * comparison of similarities between all potential rename sources + * and destinations by instead using file basename as a hint, checking + * for similarity between files with the same basename, and if we + * find a pair that are sufficiently similar, record the rename + * pair and exclude those two from the NxM matrix. + * + * This *might* cause us to find a less than optimal pairing (if + * there is another file that we are even more similar to but has a + * different basename). Given the huge performance advantage + * basename matching provides, and given the frequency with which + * people use the same basename in real world projects, that's a + * trade-off we are willing to accept when doing just rename + * detection. However, if someone wants copy detection that + * implies they are willing to spend more cycles to find + * similarities between files, so it may be less likely that this + * heuristic is wanted. + */ + + int i, renames = 0; struct strintmap sources; struct strintmap dests; + /* + * The prefeteching stuff wants to know if it can skip prefetching blobs + * that are unmodified. unmodified blobs are only relevant when doing + * copy detection. find_basename_matches() is only used when detecting + * renames, not when detecting copies, so it'll only be used when a file + * only existed in the source. Since we already know that the file + * won't be unmodified, there's no point checking for it; that's just a + * waste of resources. So set skip_unmodified to 0 so that + * estimate_similarity() and prefetch() won't waste resources checking + * for something we already know is false. + */ + int skip_unmodified = 0; + /* Create maps of basename -> fullname(s) for sources and dests */ strintmap_init_with_options(&sources, -1, NULL, 0); strintmap_init_with_options(&dests, -1, NULL, 0); @@ -412,12 +450,62 @@ static int find_basename_matches(struct diff_options *options, strintmap_set(&dests, base, i); } - /* TODO: Make use of basenames source and destination basenames */ + /* Now look for basename matchups and do similarity estimation */ + for (i = 0; i < num_src; ++i) { + char *filename = rename_src[i].p->one->path; + char *base = NULL; + intptr_t src_index; + intptr_t dst_index; + + /* Get the basename */ + base = strrchr(filename, '/'); + base = (base ? base+1 : filename); + + /* Find out if this basename is unique among sources */ + src_index = strintmap_get(&sources, base); + if (src_index == -1) + continue; /* not a unique basename; skip it */ + assert(src_index == i); + + if (strintmap_contains(&dests, base)) { + struct diff_filespec *one, *two; + int score; + + /* Find out if this basename is unique among dests */ + dst_index = strintmap_get(&dests, base); + if (dst_index == -1) + continue; /* not a unique basename; skip it */ + + /* Ignore this dest if already used in a rename */ + if (rename_dst[dst_index].is_rename) + continue; /* already used previously */ + + /* Estimate the similarity */ + one = rename_src[src_index].p->one; + two = rename_dst[dst_index].p->two; + score = estimate_similarity(options->repo, one, two, + minimum_score, skip_unmodified); + + /* If sufficiently similar, record as rename pair */ + if (score < minimum_score) + continue; + record_rename_pair(dst_index, src_index, score); + renames++; + + /* + * Found a rename so don't need text anymore; if we + * didn't find a rename, the filespec_blob would get + * re-used when doing the matrix of comparisons. + */ + diff_free_filespec_blob(one); + diff_free_filespec_blob(two); + } + } strintmap_clear(&sources); strintmap_clear(&dests); - return 0; + return renames; } #define NUM_CANDIDATE_PER_DST 4 From patchwork Tue Feb 9 11:32:05 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Elijah Newren X-Patchwork-Id: 12077849 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 995A4C433E0 for ; Tue, 9 Feb 2021 11:42:58 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 50AEB64E77 for ; Tue, 9 Feb 2021 11:42:58 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230348AbhBILmn (ORCPT ); Tue, 9 Feb 2021 06:42:43 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60464 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230232AbhBILfx (ORCPT ); Tue, 9 Feb 2021 06:35:53 -0500 Received: from mail-wm1-x335.google.com (mail-wm1-x335.google.com [IPv6:2a00:1450:4864:20::335]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id ED5C9C0617AB for ; Tue, 9 Feb 2021 03:32:12 -0800 (PST) Received: by mail-wm1-x335.google.com with SMTP id o10so1523278wmc.1 for ; Tue, 09 Feb 2021 03:32:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:mime-version :content-transfer-encoding:fcc:to:cc; bh=mGWVDkeWbHVzoF1Sb+r70QO9VawowXIwV5HfLkCAptw=; b=IYj25GztWu9jfJ4pFW90D4SpFr7MfQg6zddJz7/PzLQfS9LsHKs+l5nMqrup0KmtG0 PHZ77QhZZH8sLgxTM1GxS3U7TDH2I3jgnzSO0aHXLgrrgdn4sYcCLDyvf6phSIW20v1G p6OMbszUvqAiaiP67uSYOJMGNPjEYR6FFmSxCBtfcLjUtrzD3xhXS9V/KqquPe+g8gT+ a/fVLqA6HiJE5FEk7ZTHVSELAcTLbMFrxcb9Hf4Lwped8nMEnu5ISkrC6ckabbfi1EvW sWrALRWnQSSF0aWmngqIqVeG/8FhKTuUuWTf3DUvikAJX/8cuXtBYvoOUh2lp5lIgK5t iPhQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:mime-version:content-transfer-encoding:fcc:to:cc; bh=mGWVDkeWbHVzoF1Sb+r70QO9VawowXIwV5HfLkCAptw=; b=Z5Wwi2Bh85d5RYXSuRJNbgXkuYkOOuCvjRT2PejRmJSwttO/YAaFH3OovOwLdjTyZo tlUUD3Hza3LqwrOjgPWA6oFY5XOfZoelRdrdr4KjOhWijoYw0LLKGS12IHr6YNDD87jg MPCAnOF0TaTPossfLN/fTTyoVZZaZFRJNCDmw9DuvnYM2uaKBYpdo8EEYE0Nuk+zJ6q9 8OM8q6OAUel9IYmLCdvZUK3x6GhQXU69mMMEwOG8YvKn7jPLexSS/r8VKKpQumJrc64U VgfJQ2vTilW7RE5R5j2CaMBK/9Lx867HXh6X5fcY3XQrt4gkBPCbHH01HAuDglrxk5/U Xokw== X-Gm-Message-State: AOAM530I68wOorLfqFy5W6++8/2zxM6MOmv0tyh7vK90vb1Zbop4dI1z lWWM8DnudxhiNw5SsOk/iTramkqZ0IE= X-Google-Smtp-Source: ABdhPJwofCG4kO722iW4g9uOPscklE9uhQb/nCrWZs1MyMhHy4+lwPOTsgBJfmrlok2TsIteOZQCRA== X-Received: by 2002:a1c:40d4:: with SMTP id n203mr3040966wma.46.1612870331426; Tue, 09 Feb 2021 03:32:11 -0800 (PST) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id c9sm39062823wrw.76.2021.02.09.03.32.10 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 09 Feb 2021 03:32:10 -0800 (PST) Message-Id: In-Reply-To: References: Date: Tue, 09 Feb 2021 11:32:05 +0000 Subject: [PATCH v2 3/4] diffcore-rename: guide inexact rename detection based on basenames MIME-Version: 1.0 Fcc: Sent To: git@vger.kernel.org Cc: Derrick Stolee , Jonathan Tan , Taylor Blau , Junio C Hamano , Jeff King , Elijah Newren , Derrick Stolee , Elijah Newren , Elijah Newren Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Elijah Newren From: Elijah Newren Make use of the new find_basename_matches() function added in the last two patches, to find renames more rapidly in cases where we can match up files based on basenames. As a quick reminder (see the last two commit messages for more details), this means for example that docs/extensions.txt and docs/config/extensions.txt are considered likely renames if there are no 'extensions.txt' files elsewhere among the added and deleted files, and if a similarity check confirms they are similar, then they are marked as a rename without looking for a better similarity match among other files. This is a behavioral change, as covered in more detail in the previous commit message. We do not use this heuristic together with either break or copy detection. The point of break detection is to say that filename similarity does not imply file content similarity, and we only want to know about file content similarity. The point of copy detection is to use more resources to check for additional similarities, while this is an optimization that uses far less resources but which might also result in finding slightly fewer similarities. So the idea behind this optimization goes against both of those features, and will be turned off for both. For the testcases mentioned in commit 557ac0350d ("merge-ort: begin performance work; instrument with trace2_region_* calls", 2020-10-28), this change improves the performance as follows: Before After no-renames: 13.815 s ± 0.062 s 13.138 s ± 0.086 s mega-renames: 1799.937 s ± 0.493 s 169.488 s ± 0.494 s just-one-mega: 51.289 s ± 0.019 s 5.061 s ± 0.017 s Signed-off-by: Elijah Newren --- diffcore-rename.c | 46 +++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 41 insertions(+), 5 deletions(-) diff --git a/diffcore-rename.c b/diffcore-rename.c index b1dda41de9b1..048a6186fd21 100644 --- a/diffcore-rename.c +++ b/diffcore-rename.c @@ -367,7 +367,6 @@ static int find_exact_renames(struct diff_options *options) return renames; } -MAYBE_UNUSED static int find_basename_matches(struct diff_options *options, int minimum_score, int num_src) @@ -718,12 +717,49 @@ void diffcore_rename(struct diff_options *options) if (minimum_score == MAX_SCORE) goto cleanup; + num_sources = rename_src_nr; + + if (want_copies || break_idx) { + /* + * Cull sources: + * - remove ones corresponding to exact renames + */ + trace2_region_enter("diff", "cull after exact", options->repo); + remove_unneeded_paths_from_src(want_copies); + trace2_region_leave("diff", "cull after exact", options->repo); + } else { + /* Determine minimum score to match basenames */ + int min_basename_score = (int)(5*minimum_score + 0*MAX_SCORE)/5; + + /* + * Cull sources: + * - remove ones involved in renames (found via exact match) + */ + trace2_region_enter("diff", "cull exact", options->repo); + remove_unneeded_paths_from_src(want_copies); + trace2_region_leave("diff", "cull exact", options->repo); + + /* Utilize file basenames to quickly find renames. */ + trace2_region_enter("diff", "basename matches", options->repo); + rename_count += find_basename_matches(options, + min_basename_score, + rename_src_nr); + trace2_region_leave("diff", "basename matches", options->repo); + + /* + * Cull sources, again: + * - remove ones involved in renames (found via basenames) + */ + trace2_region_enter("diff", "cull basename", options->repo); + remove_unneeded_paths_from_src(want_copies); + trace2_region_leave("diff", "cull basename", options->repo); + } + /* - * Calculate how many renames are left + * Calculate how many rename destinations are left */ num_destinations = (rename_dst_nr - rename_count); - remove_unneeded_paths_from_src(want_copies); - num_sources = rename_src_nr; + num_sources = rename_src_nr; /* rename_src_nr reflects lower number */ /* All done? */ if (!num_destinations || !num_sources) @@ -755,7 +791,7 @@ void diffcore_rename(struct diff_options *options) struct diff_score *m; if (rename_dst[i].is_rename) - continue; /* dealt with exact match already. */ + continue; /* exact or basename match already handled */ m = &mx[dst_cnt * NUM_CANDIDATE_PER_DST]; for (j = 0; j < NUM_CANDIDATE_PER_DST; j++) From patchwork Tue Feb 9 11:32:06 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Elijah Newren X-Patchwork-Id: 12077853 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5C6DAC433DB for ; Tue, 9 Feb 2021 11:44:05 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 1042364E7C for ; Tue, 9 Feb 2021 11:44:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230280AbhBILnF (ORCPT ); Tue, 9 Feb 2021 06:43:05 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60466 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230242AbhBILhA (ORCPT ); Tue, 9 Feb 2021 06:37:00 -0500 Received: from mail-wr1-x42d.google.com (mail-wr1-x42d.google.com [IPv6:2a00:1450:4864:20::42d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 38E8BC061356 for ; Tue, 9 Feb 2021 03:32:14 -0800 (PST) Received: by mail-wr1-x42d.google.com with SMTP id h12so4908856wrw.6 for ; Tue, 09 Feb 2021 03:32:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:fcc :content-transfer-encoding:mime-version:to:cc; bh=JFVBRHg6s2ehfRB8eUQwTpqlarz3y3sOyRm0WMS1zsU=; b=fM6NvMRNIk69ebjhe0dtH6HVblU11RrglUeJQa5PzGQClwzUyZgKR10ZoLjPyhQJto GoVlFQwXdrQidzlh/qthsl3L80bckG3Cknp6FQ7GwB6ARilzHQy0ccxF3jmp/sy9JEgu T2boYygpxtCrSFn2LX3NdC+YyZ7NBMg/4YMO/fiaZypKqM0tI/9HKDndwhhqys+NmRQa CwNbw+SShDFD7mSKZvWWa8g9fZ4H7my7x8kkn6bGGpP8irJ7qFt4AFRrpBvVz/o5lknJ ihFUGyPgJMayBpoHO9GkyrLJZpE2+aKhCxQYO0RwSB8/h87BM1tJMs5sh09KvIbk/t4i Wkjg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:fcc:content-transfer-encoding:mime-version:to:cc; bh=JFVBRHg6s2ehfRB8eUQwTpqlarz3y3sOyRm0WMS1zsU=; b=Rk5CD3YJrvnGwle/r0K3RS7uewj731XMO66AyK23hSsw0UkCOjOvu30kLXbTyYBPjK tUsVt5saoa7scYxRMrcBu3NHmS28hV/I4WnB6e9j38CZkjVMHtapfVjWCi3BzIRBHpYG L9yjHynBsbX63x1px8rH8Lpz7k+IShs6anyLWPqtl1FxG+NyJ4JyGhSjM+V3tgsbBRIu /TGQr6X5aO2/D4AJpDyF8wjYXKiOQC72q+AkFLqDjxsuT5hityeFakazf6FvwHQAuqtd 7GdrAFA/3BAIjBvG0SlTLE8yki9VOzt5hG6A5EanF4ng+9snzj1YEA/uiXr3zNScET8x lMWQ== X-Gm-Message-State: AOAM530Y9WrzH6lMJZoXXbqT+kTQWyiz1sqCX5Ucp3bkK9xshz0mv7eJ LWCX+N6YR7hPTIijKN/zTP13w7zk2FQ= X-Google-Smtp-Source: ABdhPJyfLjRWbTFL3ZEa97InmpyPF4mmd64c6Y1mO7+IdlpBnQH7bdEykFYqkw/R5fpxfGt8LsyM0w== X-Received: by 2002:adf:f40d:: with SMTP id g13mr24801163wro.142.1612870332696; Tue, 09 Feb 2021 03:32:12 -0800 (PST) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id f14sm4071253wmc.32.2021.02.09.03.32.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 09 Feb 2021 03:32:11 -0800 (PST) Message-Id: In-Reply-To: References: Date: Tue, 09 Feb 2021 11:32:06 +0000 Subject: [PATCH v2 4/4] gitdiffcore doc: mention new preliminary step for rename detection Fcc: Sent MIME-Version: 1.0 To: git@vger.kernel.org Cc: Derrick Stolee , Jonathan Tan , Taylor Blau , Junio C Hamano , Jeff King , Elijah Newren , Derrick Stolee , Elijah Newren , Elijah Newren Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Elijah Newren From: Elijah Newren The last few patches have introduced a new preliminary step when rename detection is on but both break detection and copy detection are off. Document this new step. While we're at it, add a testcase that checks the new behavior as well. Signed-off-by: Elijah Newren --- Documentation/gitdiffcore.txt | 15 +++++++++++++++ t/t4001-diff-rename.sh | 24 ++++++++++++++++++++++++ 2 files changed, 39 insertions(+) diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt index c970d9fe438a..954ae3ef1082 100644 --- a/Documentation/gitdiffcore.txt +++ b/Documentation/gitdiffcore.txt @@ -168,6 +168,21 @@ a similarity score different from the default of 50% by giving a number after the "-M" or "-C" option (e.g. "-M8" to tell it to use 8/10 = 80%). +Note that when rename detection is on but both copy and break +detection are off, rename detection adds a preliminary step that first +checks files with the same basename. If files with the same basename +are sufficiently similar, it will mark them as renames and exclude +them from the later quadratic step (the one that pairwise compares all +unmatched files to find the "best" matches, determined by the highest +content similarity). So, for example, if docs/extensions.txt and +docs/config/extensions.txt have similar content, then they will be +marked as a rename even if it turns out that docs/extensions.txt was +more similar to src/extension-checks.c. At most, one comparison is +done per file in this preliminary pass; so if there are several +extensions.txt files throughout the directory hierarchy that were +added and deleted, this preliminary step will be skipped for those +files. + Note. When the "-C" option is used with `--find-copies-harder` option, 'git diff-{asterisk}' commands feed unmodified filepairs to diffcore mechanism as well as modified ones. This lets the copy diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh index c16486a9d41a..bf62537c29a0 100755 --- a/t/t4001-diff-rename.sh +++ b/t/t4001-diff-rename.sh @@ -262,4 +262,28 @@ test_expect_success 'diff-tree -l0 defaults to a big rename limit, not zero' ' grep "myotherfile.*myfile" actual ' +test_expect_success 'basename similarity vs best similarity' ' + mkdir subdir && + test_write_lines line1 line2 line3 line4 line5 \ + line6 line7 line8 line9 line10 >subdir/file.txt && + git add subdir/file.txt && + git commit -m "base txt" && + + git rm subdir/file.txt && + test_write_lines line1 line2 line3 line4 line5 \ + line6 line7 line8 >file.txt && + test_write_lines line1 line2 line3 line4 line5 \ + line6 line7 line8 line9 >file.md && + git add file.txt file.md && + git commit -a -m "rename" && + git diff-tree -r -M --name-status HEAD^ HEAD >actual && + # subdir/file.txt is 89% similar to file.md, 78% similar to file.txt, + # but since same basenames are checked first... + cat >expected <<-\EOF && + A file.md + R078 subdir/file.txt file.txt + EOF + test_cmp expected actual +' + test_done