From patchwork Thu Feb 11 08:15:44 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Elijah Newren X-Patchwork-Id: 12082467 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id DB14EC433E0 for ; Thu, 11 Feb 2021 08:17:10 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id AD8F264DFF for ; Thu, 11 Feb 2021 08:17:10 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229815AbhBKIQk (ORCPT ); Thu, 11 Feb 2021 03:16:40 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44916 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229771AbhBKIQd (ORCPT ); Thu, 11 Feb 2021 03:16:33 -0500 Received: from mail-wm1-x336.google.com (mail-wm1-x336.google.com [IPv6:2a00:1450:4864:20::336]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0CF27C061756 for ; Thu, 11 Feb 2021 00:15:53 -0800 (PST) Received: by mail-wm1-x336.google.com with SMTP id u14so4586532wmq.4 for ; Thu, 11 Feb 2021 00:15:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:fcc :content-transfer-encoding:mime-version:to:cc; bh=JHMHk+LDzMWyCqxgl66ibfHAEuooiwB7NzbFFSqr88Q=; b=Ju4SIVr6pRHuVJmDs5M7zXToEAZLrv+dQKsACO8rhz5aL5VSx2D8EScA73mmKu8B78 S0dV6MbYL/541KgL5vg8kQ8xZczV6Wgr1r1/gSrMIISPnvEbStn0pHrOnNqEoop6KLUv PAecisprFRS7AiPdGsy+Au2D0OJTdEuga/mQiRKoWNGAsLQSp/vdxQ3m96GBmDkG1oL5 yQfOeq3WaIKZcbJR8UNRT2q+MStnCFbAoAsLg3z6nMdRznYnAmAXK+TAcE2xQ1aOAuYX FypBmjYVzSGnURzb2RNT56CKY09UOERiYNFmf1x8+TTGfi1I3rPIBVNbBw2p5q2pZcTI aikQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:fcc:content-transfer-encoding:mime-version:to:cc; bh=JHMHk+LDzMWyCqxgl66ibfHAEuooiwB7NzbFFSqr88Q=; b=muhSrXF9Tr2AFh2YOHeJa3ledyAjoDbpH1/UmH7AInBBtVx2SVxZmlQCSzDqiAgM+O upnb63lJaTZKBp6RdrjjPlGwHIyZol7JP0mnXTEpVrFvgglQIycY+yaK3n2x7YMzMCj9 Arl2ecEdt2Rv0y4Tsq50/hrasqLvE59GkAas2I5c8vTrZg6sL1YTyny0YylkUB++wGsb bdeS7UNi3kGmSWqq83oIPIquL2i+9ZI3VfynEhxhRcLvyq487uiopxQn/TVpgRD3wq4M imY3T7dHzTH9mReiF0O5JKT3BdmwPjamivZj6oSjftwgMxLlorFzHqCAkqcWzFFaihL/ 2zhg== X-Gm-Message-State: AOAM533bm6FwrAdKsSD/BuE6VyWBQT2YEhxjvjBrDOA2LQ4ot4cAv0Hw MccjQzSTmTklcBqEpXS0mPWs3nWBS68= X-Google-Smtp-Source: ABdhPJwe7VTK9Co7gnp8wkVTjGQ2UBhhgwx+c2omRvRzzyshRoxbirEsm3yTNqBfioB7ydr2qvODxA== X-Received: by 2002:a1c:ed1a:: with SMTP id l26mr3963407wmh.114.1613031351736; Thu, 11 Feb 2021 00:15:51 -0800 (PST) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id w12sm7955650wmi.4.2021.02.11.00.15.51 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 11 Feb 2021 00:15:51 -0800 (PST) Message-Id: <3e6af929d135ef2dc239e2f47f92a7e2e91cbd17.1613031350.git.gitgitgadget@gmail.com> In-Reply-To: References: Date: Thu, 11 Feb 2021 08:15:44 +0000 Subject: [PATCH v4 1/6] t4001: add a test comparing basename similarity and content similarity Fcc: Sent MIME-Version: 1.0 To: git@vger.kernel.org Cc: Derrick Stolee , Jonathan Tan , Taylor Blau , Junio C Hamano , Jeff King , Elijah Newren , Derrick Stolee , Elijah Newren , Elijah Newren Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Elijah Newren From: Elijah Newren Add a simple test where a removed file is similar to two different added files; one of them has the same basename, and the other has a slightly higher content similarity. Without break detection, filename similarity of 100% trumps content similarity for pairing up related files. For any filename similarity less than 100%, the opposite is true -- content similarity is all that matters. Add a testcase that documents this. Subsequent commits will add a new rule that includes an inbetween state, where a mixture of filename similarity and content similarity are weighed, and which will change the outcome of this testcase. Signed-off-by: Elijah Newren --- t/t4001-diff-rename.sh | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh index c16486a9d41a..797343b38106 100755 --- a/t/t4001-diff-rename.sh +++ b/t/t4001-diff-rename.sh @@ -262,4 +262,28 @@ test_expect_success 'diff-tree -l0 defaults to a big rename limit, not zero' ' grep "myotherfile.*myfile" actual ' +test_expect_success 'basename similarity vs best similarity' ' + mkdir subdir && + test_write_lines line1 line2 line3 line4 line5 \ + line6 line7 line8 line9 line10 >subdir/file.txt && + git add subdir/file.txt && + git commit -m "base txt" && + + git rm subdir/file.txt && + test_write_lines line1 line2 line3 line4 line5 \ + line6 line7 line8 >file.txt && + test_write_lines line1 line2 line3 line4 line5 \ + line6 line7 line8 line9 >file.md && + git add file.txt file.md && + git commit -a -m "rename" && + git diff-tree -r -M --name-status HEAD^ HEAD >actual && + # subdir/file.txt is 89% similar to file.md, 78% similar to file.txt, + # but since same basenames are checked first... + cat >expected <<-\EOF && + R088 subdir/file.txt file.md + A file.txt + EOF + test_cmp expected actual +' + test_done From patchwork Thu Feb 11 08:15:45 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Elijah Newren X-Patchwork-Id: 12082469 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B5E6CC433DB for ; Thu, 11 Feb 2021 08:17:13 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 872E564E7D for ; Thu, 11 Feb 2021 08:17:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229837AbhBKIQx (ORCPT ); Thu, 11 Feb 2021 03:16:53 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44922 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229775AbhBKIQe (ORCPT ); Thu, 11 Feb 2021 03:16:34 -0500 Received: from mail-wm1-x32e.google.com (mail-wm1-x32e.google.com [IPv6:2a00:1450:4864:20::32e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AA108C0613D6 for ; Thu, 11 Feb 2021 00:15:53 -0800 (PST) Received: by mail-wm1-x32e.google.com with SMTP id 190so4436967wmz.0 for ; Thu, 11 Feb 2021 00:15:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:fcc :content-transfer-encoding:mime-version:to:cc; bh=qRSc+dm1tQkHMASnWY40JsX73g/hNfM0nZohRVo2Sjo=; b=RK+R/D7bDaAorDxtRAgInsFvqlXSVVcaZfUlqW7ki0+7E2ptYPUfU77r9VPqqu6bXr Zf+k4ObmiD2nJXJP58kx2oM2AYzbqYo3qEiMMZpyBSUzY9Rj7MjOKgx88apV8gy0dmZ6 XpSNnft19HYHD0WB7q6jJ5EZhhgC2ixTR/WjRcrgSG+vGJZi037NIXwxQ1/Ds71BzcbD p0/s3ecthwiyfhE/XekWgNQn6YhAVMcEtr2yq6HxRbZvqq6TPf3C+DlYOPaCII6Cb2hK k+GatgoetoIW/uj7AzLGfIUa/jKDoO2ISaTDinxT+PweR8bXFQuQZJrM866NX2WgR+jC VU1g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:fcc:content-transfer-encoding:mime-version:to:cc; bh=qRSc+dm1tQkHMASnWY40JsX73g/hNfM0nZohRVo2Sjo=; b=nt2ajowukuKNrG5ewjQvoPDv3zjcw+psUu86kfoNDR/4e9ZlthKPRUX+35RA6y/NMm S9SWL7iN+69CXr6i8Us0w0bPiq+FRJ3q/9qNK2foTjhe77WplL4+CVRaH6UpRgS8k2zg uHTE5N6vPSyeGCyTxA+fTBQAbG6Xa9hbH9R1iakjQOTCcc8esOOa83wj4QygPNqqAhlW 0f4MrgW+sI+RF58v4utPcnm0ETL0ZCTWqNzrU9kGkTv0OX+QkJsQuy/VBPdqLWKE35Ic oHrets2xVzCH9iTkfVp0Teswkb0jsPuOpbjvnWJXPyE7sP/bboxPe+b9us+ThSg7T6LZ Ay1A== X-Gm-Message-State: AOAM532OSE7GuN1zOLLlOM84U/6qvrXNh1P/hz0NJrvzMKxqpAz+sCvt YCBeUWVo4gCkMoHmCQhCkCGx4H5w6MQ= X-Google-Smtp-Source: ABdhPJxbYobt8Ww8CRrkPmJ55ND6kg8J1jMnp4RS1wNi6KET/RK7f/Tz4vafxuJi5sjm/xfWNxlv/w== X-Received: by 2002:a05:600c:8a6:: with SMTP id l38mr503458wmp.44.1613031352405; Thu, 11 Feb 2021 00:15:52 -0800 (PST) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id x10sm895431wmg.6.2021.02.11.00.15.52 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 11 Feb 2021 00:15:52 -0800 (PST) Message-Id: <4fff9b1ff57b62587b1cbec2031f96199a214702.1613031350.git.gitgitgadget@gmail.com> In-Reply-To: References: Date: Thu, 11 Feb 2021 08:15:45 +0000 Subject: [PATCH v4 2/6] diffcore-rename: compute basenames of all source and dest candidates Fcc: Sent MIME-Version: 1.0 To: git@vger.kernel.org Cc: Derrick Stolee , Jonathan Tan , Taylor Blau , Junio C Hamano , Jeff King , Elijah Newren , Derrick Stolee , Elijah Newren , Elijah Newren Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Elijah Newren From: Elijah Newren We want to make use of unique basenames to help inform rename detection, so that more likely pairings can be checked first. (src/moduleA/foo.txt and source/module/A/foo.txt are likely related if there are no other 'foo.txt' files among the deleted and added files.) Add a new function, not yet used, which creates a map of the unique basenames within rename_src and another within rename_dst, together with the indices within rename_src/rename_dst where those basenames show up. Non-unique basenames still show up in the map, but have an invalid index (-1). This function was inspired by the fact that in real world repositories, files are often moved across directories without changing names. Here are some sample repositories and the percentage of their historical renames (as of early 2020) that preserved basenames: * linux: 76% * gcc: 64% * gecko: 79% * webkit: 89% These statistics alone don't prove that an optimization in this area will help or how much it will help, since there are also unpaired adds and deletes, restrictions on which basenames we consider, etc., but it certainly motivated the idea to try something in this area. Signed-off-by: Elijah Newren --- diffcore-rename.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 61 insertions(+) diff --git a/diffcore-rename.c b/diffcore-rename.c index 74930716e70d..3eb49a098adf 100644 --- a/diffcore-rename.c +++ b/diffcore-rename.c @@ -367,6 +367,67 @@ static int find_exact_renames(struct diff_options *options) return renames; } +static const char *get_basename(const char *filename) +{ + /* + * gitbasename() has to worry about special drivers, multiple + * directory separator characters, trailing slashes, NULL or + * empty strings, etc. We only work on filenames as stored in + * git, and thus get to ignore all those complications. + */ + const char *base = strrchr(filename, '/'); + return base ? base + 1 : filename; +} + +MAYBE_UNUSED +static int find_basename_matches(struct diff_options *options, + int minimum_score, + int num_src) +{ + int i; + struct strintmap sources; + struct strintmap dests; + + /* Create maps of basename -> fullname(s) for sources and dests */ + strintmap_init_with_options(&sources, -1, NULL, 0); + strintmap_init_with_options(&dests, -1, NULL, 0); + for (i = 0; i < num_src; ++i) { + char *filename = rename_src[i].p->one->path; + const char *base; + + /* exact renames removed in remove_unneeded_paths_from_src() */ + assert(!rename_src[i].p->one->rename_used); + + /* Record index within rename_src (i) if basename is unique */ + base = get_basename(filename); + if (strintmap_contains(&sources, base)) + strintmap_set(&sources, base, -1); + else + strintmap_set(&sources, base, i); + } + for (i = 0; i < rename_dst_nr; ++i) { + char *filename = rename_dst[i].p->two->path; + const char *base; + + if (rename_dst[i].is_rename) + continue; /* involved in exact match already. */ + + /* Record index within rename_dst (i) if basename is unique */ + base = get_basename(filename); + if (strintmap_contains(&dests, base)) + strintmap_set(&dests, base, -1); + else + strintmap_set(&dests, base, i); + } + + /* TODO: Make use of basenames source and destination basenames */ + + strintmap_clear(&sources); + strintmap_clear(&dests); + + return 0; +} + #define NUM_CANDIDATE_PER_DST 4 static void record_if_better(struct diff_score m[], struct diff_score *o) { From patchwork Thu Feb 11 08:15:46 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Elijah Newren X-Patchwork-Id: 12082471 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C867BC433E6 for ; Thu, 11 Feb 2021 08:17:13 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 9FC2264DFF for ; Thu, 11 Feb 2021 08:17:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229818AbhBKIQ4 (ORCPT ); Thu, 11 Feb 2021 03:16:56 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44928 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229777AbhBKIQf (ORCPT ); Thu, 11 Feb 2021 03:16:35 -0500 Received: from mail-wm1-x32b.google.com (mail-wm1-x32b.google.com [IPv6:2a00:1450:4864:20::32b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 80D41C061574 for ; Thu, 11 Feb 2021 00:15:54 -0800 (PST) Received: by mail-wm1-x32b.google.com with SMTP id u16so779472wmq.1 for ; Thu, 11 Feb 2021 00:15:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:fcc :content-transfer-encoding:mime-version:to:cc; bh=mj5Y7AlVUBbet8Wcga40zNKhfgQmVARKb2P7Xr/sYP4=; b=jOlO7IvYf27+jKS3+Q/VvVz8GAMRYt5qVbiROdBjjkICLDBV6sTBHI2nu4bDefXp4D LjnzDvs8jhP7CKyr1+qgIloK2MhLPd3Iyh6mPWOVIshiNy9kQfeq8CqeiUO7eLMsdOuA 0mETT+NxnaICbidi8pyXG3t31iEjigrAJOe1TYy1XHvF74XnbUrfUKei3VXLIAHAEkce 8uEMx3uwDwZPfK7XIiVeBw4S8PPZWLWOALNl0Qz6EHoZ9jsu2WCamW/Htc9N9NLg+BKi ZqG5bvs2NoPVIWxjgdMydH44v4R2nE5QNejTkjgi85OCaMi4oQqUZu7ytrlGbTYAjzYu xkmw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:fcc:content-transfer-encoding:mime-version:to:cc; bh=mj5Y7AlVUBbet8Wcga40zNKhfgQmVARKb2P7Xr/sYP4=; b=tdn2TkS2aFMYzPqO8V++w8Irwl3AyX0qzspKNUsmbYZSw4MMC/0rUzV3cd6pwXiDU8 R3Jn6lw4TV9megZyEwA9KwYl9Vq8SoUjJMtGA+4Fa6+UV3vK77TAzVvm7EQbDRdXwl5G ycOKKaoIXjyMP+fLAENNu/uLMRmGAh4wcNLzD6U7tkcb/waq8+Oet/NY6HhMqPzzEJxI AJEFxJOY5MfTATZij+93QsJUoOOTSpynGsdCCBTeDGgnrYYJs/6JhMv3zCxeXVcJXXPf frt/AyUtSfJKiwBylZZy2YVlIeeuFJ1CujYl/1cqGke0RUx7Xx/eF1TtQstBTGzxY/1z 6frw== X-Gm-Message-State: AOAM532CrFBmer816lLSX2ZWLZOb9tV5xMkoViAsljRYrCqTLPvfja30 29OFgJQm77pEarfH37Lnj2Ml2x9qh+w= X-Google-Smtp-Source: ABdhPJzYwrUvcU4G2UoEwFwz3a3htXBFelPkHiv7p6XQuDVTidMFpI+R+Qxsw1gkGYzwLP+jJ/jC3Q== X-Received: by 2002:a1c:60c1:: with SMTP id u184mr1942937wmb.22.1613031353206; Thu, 11 Feb 2021 00:15:53 -0800 (PST) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id g16sm7699303wmi.30.2021.02.11.00.15.52 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 11 Feb 2021 00:15:52 -0800 (PST) Message-Id: In-Reply-To: References: Date: Thu, 11 Feb 2021 08:15:46 +0000 Subject: [PATCH v4 3/6] diffcore-rename: complete find_basename_matches() Fcc: Sent MIME-Version: 1.0 To: git@vger.kernel.org Cc: Derrick Stolee , Jonathan Tan , Taylor Blau , Junio C Hamano , Jeff King , Elijah Newren , Derrick Stolee , Elijah Newren , Elijah Newren Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Elijah Newren From: Elijah Newren It is not uncommon in real world repositories for the majority of file renames to not change the basename of the file; i.e. most "renames" are just a move of files into different directories. We can make use of this to avoid comparing all rename source candidates with all rename destination candidates, by first comparing sources to destinations with the same basenames. If two files with the same basename are sufficiently similar, we record the rename; if not, we include those files in the more exhaustive matrix comparison. This means we are adding a set of preliminary additional comparisons, but for each file we only compare it with at most one other file. For example, if there was a include/media/device.h that was deleted and a src/module/media/device.h that was added, and there were no other device.h files added or deleted between the commits being compared, then these two files would be compared in the preliminary step. This commit does not yet actually employ this new optimization, it merely adds a function which can be used for this purpose. The next commit will do the necessary plumbing to make use of it. Note that this optimization might give us different results than without the optimization, because it's possible that despite files with the same basename being sufficiently similar to be considered a rename, there's an even better match between files without the same basename. I think that is okay for four reasons: (1) it's easy to explain to the users what happened if it does ever occur (or even for them to intuitively figure out), (2) as the next patch will show it provides such a large performance boost that it's worth the tradeoff, and (3) it's somewhat unlikely that despite having unique matching basenames that other files serve as better matches. Reason (4) takes a full paragraph to explain... If the previous three reasons aren't enough, consider what rename detection already does. Break detection is not the default, meaning that if files have the same _fullname_, then they are considered related even if they are 0% similar. In fact, in such a case, we don't even bother comparing the files to see if they are similar let alone comparing them to all other files to see what they are most similar to. Basically, we override content similarity based on sufficient filename similarity. Without the filename similarity (currently implemented as an exact match of filename), we swing the pendulum the opposite direction and say that filename similarity is irrelevant and compare a full N x M matrix of sources and destinations to find out which have the most similar contents. This optimization just adds another form of filename similarity comparison, but augments it with a file content similarity check as well. Basically, if two files have the same basename and are sufficiently similar to be considered a rename, mark them as such without comparing the two to all other rename candidates. Signed-off-by: Elijah Newren --- diffcore-rename.c | 95 +++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 92 insertions(+), 3 deletions(-) diff --git a/diffcore-rename.c b/diffcore-rename.c index 3eb49a098adf..001645624e71 100644 --- a/diffcore-rename.c +++ b/diffcore-rename.c @@ -384,10 +384,52 @@ static int find_basename_matches(struct diff_options *options, int minimum_score, int num_src) { - int i; + /* + * When I checked in early 2020, over 76% of file renames in linux + * just moved files to a different directory but kept the same + * basename. gcc did that with over 64% of renames, gecko did it + * with over 79%, and WebKit did it with over 89%. + * + * Therefore we can bypass the normal exhaustive NxM matrix + * comparison of similarities between all potential rename sources + * and destinations by instead using file basename as a hint (i.e. + * the portion of the filename after the last '/'), checking for + * similarity between files with the same basename, and if we find + * a pair that are sufficiently similar, record the rename pair and + * exclude those two from the NxM matrix. + * + * This *might* cause us to find a less than optimal pairing (if + * there is another file that we are even more similar to but has a + * different basename). Given the huge performance advantage + * basename matching provides, and given the frequency with which + * people use the same basename in real world projects, that's a + * trade-off we are willing to accept when doing just rename + * detection. + * + * If someone wants copy detection that implies they are willing to + * spend more cycles to find similarities between files, so it may + * be less likely that this heuristic is wanted. If someone is + * doing break detection, that means they do not want filename + * similarity to imply any form of content similiarity, and thus + * this heuristic would definitely be incompatible. + */ + + int i, renames = 0; struct strintmap sources; struct strintmap dests; + /* + * The prefeteching stuff wants to know if it can skip prefetching + * blobs that are unmodified...and will then do a little extra work + * to verify that the oids are indeed different before prefetching. + * Unmodified blobs are only relevant when doing copy detection; + * when limiting to rename detection, diffcore_rename[_extended]() + * will never be called with unmodified source paths fed to us, so + * the extra work necessary to check if rename_src entries are + * unmodified would be a small waste. + */ + int skip_unmodified = 0; + /* Create maps of basename -> fullname(s) for sources and dests */ strintmap_init_with_options(&sources, -1, NULL, 0); strintmap_init_with_options(&dests, -1, NULL, 0); @@ -420,12 +462,59 @@ static int find_basename_matches(struct diff_options *options, strintmap_set(&dests, base, i); } - /* TODO: Make use of basenames source and destination basenames */ + /* Now look for basename matchups and do similarity estimation */ + for (i = 0; i < num_src; ++i) { + char *filename = rename_src[i].p->one->path; + const char *base = NULL; + intptr_t src_index; + intptr_t dst_index; + + /* Find out if this basename is unique among sources */ + base = get_basename(filename); + src_index = strintmap_get(&sources, base); + if (src_index == -1) + continue; /* not a unique basename; skip it */ + assert(src_index == i); + + if (strintmap_contains(&dests, base)) { + struct diff_filespec *one, *two; + int score; + + /* Find out if this basename is unique among dests */ + dst_index = strintmap_get(&dests, base); + if (dst_index == -1) + continue; /* not a unique basename; skip it */ + + /* Ignore this dest if already used in a rename */ + if (rename_dst[dst_index].is_rename) + continue; /* already used previously */ + + /* Estimate the similarity */ + one = rename_src[src_index].p->one; + two = rename_dst[dst_index].p->two; + score = estimate_similarity(options->repo, one, two, + minimum_score, skip_unmodified); + + /* If sufficiently similar, record as rename pair */ + if (score < minimum_score) + continue; + record_rename_pair(dst_index, src_index, score); + renames++; + + /* + * Found a rename so don't need text anymore; if we + * didn't find a rename, the filespec_blob would get + * re-used when doing the matrix of comparisons. + */ + diff_free_filespec_blob(one); + diff_free_filespec_blob(two); + } + } strintmap_clear(&sources); strintmap_clear(&dests); - return 0; + return renames; } #define NUM_CANDIDATE_PER_DST 4 From patchwork Thu Feb 11 08:15:47 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Elijah Newren X-Patchwork-Id: 12082475 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 024CBC433E0 for ; Thu, 11 Feb 2021 08:17:14 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id BA65F64EAC for ; Thu, 11 Feb 2021 08:17:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229836AbhBKIQ6 (ORCPT ); Thu, 11 Feb 2021 03:16:58 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44930 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229787AbhBKIQf (ORCPT ); Thu, 11 Feb 2021 03:16:35 -0500 Received: from mail-wm1-x334.google.com (mail-wm1-x334.google.com [IPv6:2a00:1450:4864:20::334]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1C8AAC061786 for ; Thu, 11 Feb 2021 00:15:55 -0800 (PST) Received: by mail-wm1-x334.google.com with SMTP id w4so4426272wmi.4 for ; Thu, 11 Feb 2021 00:15:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:mime-version :content-transfer-encoding:fcc:to:cc; bh=mlzyN6EN09cYDXnT25zJbM8VpmYb8hGqkzmMpJxXOqY=; b=A6+DHpwe6NYUiPpSoIrL6hUE6B3WYzVUimeHtigIyvkU5ZneP+kUNVoRHfNtIXArMQ kepB0kW7n9YqZ+Kv+cQLtzVuf8sU5D4JKVV6vdVT74KOAtU+u5GgrVZxaMWsX7fQsfBO OPJDvN/SKGfG7Lj/NygItLvtwiaEbmX4Vc9xro6i+StEKy4x7H1Pw7cLcJzRIMnUn5Yc ls/V81v5Ys9e3ER8lfCwPgEBifAJRmlL2a4FLPLMp6OF5/5NrPTIwH8DtShsmNaKrPUB rCC/yt6DQrrskVUgPyyj2CF3KyvvjsAfZiGvMJS68JM/KzcIahmNfCWplS4/iTEgYcQF jskQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:mime-version:content-transfer-encoding:fcc:to:cc; bh=mlzyN6EN09cYDXnT25zJbM8VpmYb8hGqkzmMpJxXOqY=; b=gvUJz0hWzVojA+auAFiYeiyzAvBAClVp/EVm6B7kM0JexS1dGmpalGiqwfsaD2Lo9E MmFFjKgA5cIHldPkJErgjmSCXJkaH+E1lxQaCrdyDI7YiXmjj5NK297K4f4RoUBcjyXj H6bWzBWp8JLGk97fBb3qZRmYfuq6vXb1b+kOi5uimICmNdz4dXHFuh2h5fnmOxipo2lD eE6piMHqvtT133aM0ZiZK+YdcGJk7yh1tKTW86Oq6ViyWphiuqM0fCBhRtccS7VmMf3f ++vjr9zvn90av89vusxvvaVy2Y4StBGf8Y2uLuJpB8BIL9bb2vclKoVyhriVG41wYkqS 520g== X-Gm-Message-State: AOAM531JXfLtqZsRp1ikAwMcmxfLhKLEGK+9Zs+/Inn5RT35Y7+y/SFY gmhE3qkGhYhfGx1HerbaCj8+4Bud6W8= X-Google-Smtp-Source: ABdhPJznvwxW4988X7Bp6eg8uVH0jOEMxkudY6D9DMHb9Fp/j7ufNGh2bgl5Nxt8AiZRhG/RMA/R4A== X-Received: by 2002:a1c:c242:: with SMTP id s63mr3917924wmf.9.1613031353845; Thu, 11 Feb 2021 00:15:53 -0800 (PST) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id g62sm8462133wmf.46.2021.02.11.00.15.53 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 11 Feb 2021 00:15:53 -0800 (PST) Message-Id: <2493f4b2f55d2b602e448db3f09da1ee3a97c81b.1613031350.git.gitgitgadget@gmail.com> In-Reply-To: References: Date: Thu, 11 Feb 2021 08:15:47 +0000 Subject: [PATCH v4 4/6] diffcore-rename: guide inexact rename detection based on basenames MIME-Version: 1.0 Fcc: Sent To: git@vger.kernel.org Cc: Derrick Stolee , Jonathan Tan , Taylor Blau , Junio C Hamano , Jeff King , Elijah Newren , Derrick Stolee , Elijah Newren , Elijah Newren Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Elijah Newren From: Elijah Newren Make use of the new find_basename_matches() function added in the last two patches, to find renames more rapidly in cases where we can match up files based on basenames. As a quick reminder (see the last two commit messages for more details), this means for example that docs/extensions.txt and docs/config/extensions.txt are considered likely renames if there are no 'extensions.txt' files elsewhere among the added and deleted files, and if a similarity check confirms they are similar, then they are marked as a rename without looking for a better similarity match among other files. This is a behavioral change, as covered in more detail in the previous commit message. We do not use this heuristic together with either break or copy detection. The point of break detection is to say that filename similarity does not imply file content similarity, and we only want to know about file content similarity. The point of copy detection is to use more resources to check for additional similarities, while this is an optimization that uses far less resources but which might also result in finding slightly fewer similarities. So the idea behind this optimization goes against both of those features, and will be turned off for both. For the testcases mentioned in commit 557ac0350d ("merge-ort: begin performance work; instrument with trace2_region_* calls", 2020-10-28), this change improves the performance as follows: Before After no-renames: 13.815 s ± 0.062 s 13.294 s ± 0.103 s mega-renames: 1799.937 s ± 0.493 s 187.248 s ± 0.882 s just-one-mega: 51.289 s ± 0.019 s 5.557 s ± 0.017 s Signed-off-by: Elijah Newren --- diffcore-rename.c | 54 ++++++++++++++++++++++++++++++++++++++---- t/t4001-diff-rename.sh | 4 ++-- 2 files changed, 51 insertions(+), 7 deletions(-) diff --git a/diffcore-rename.c b/diffcore-rename.c index 001645624e71..df76e475c710 100644 --- a/diffcore-rename.c +++ b/diffcore-rename.c @@ -379,7 +379,6 @@ static const char *get_basename(const char *filename) return base ? base + 1 : filename; } -MAYBE_UNUSED static int find_basename_matches(struct diff_options *options, int minimum_score, int num_src) @@ -727,12 +726,57 @@ void diffcore_rename(struct diff_options *options) if (minimum_score == MAX_SCORE) goto cleanup; + num_sources = rename_src_nr; + + if (want_copies || break_idx) { + /* + * Cull sources: + * - remove ones corresponding to exact renames + */ + trace2_region_enter("diff", "cull after exact", options->repo); + remove_unneeded_paths_from_src(want_copies); + trace2_region_leave("diff", "cull after exact", options->repo); + } else { + /* Determine minimum score to match basenames */ + double factor = 0.5; + char *basename_factor = getenv("GIT_BASENAME_FACTOR"); + int min_basename_score; + + if (basename_factor) + factor = strtol(basename_factor, NULL, 10)/100.0; + assert(factor >= 0.0 && factor <= 1.0); + min_basename_score = minimum_score + + (int)(factor * (MAX_SCORE - minimum_score)); + + /* + * Cull sources: + * - remove ones involved in renames (found via exact match) + */ + trace2_region_enter("diff", "cull after exact", options->repo); + remove_unneeded_paths_from_src(want_copies); + trace2_region_leave("diff", "cull after exact", options->repo); + + /* Utilize file basenames to quickly find renames. */ + trace2_region_enter("diff", "basename matches", options->repo); + rename_count += find_basename_matches(options, + min_basename_score, + rename_src_nr); + trace2_region_leave("diff", "basename matches", options->repo); + + /* + * Cull sources, again: + * - remove ones involved in renames (found via basenames) + */ + trace2_region_enter("diff", "cull basename", options->repo); + remove_unneeded_paths_from_src(want_copies); + trace2_region_leave("diff", "cull basename", options->repo); + } + /* - * Calculate how many renames are left + * Calculate how many rename destinations are left */ num_destinations = (rename_dst_nr - rename_count); - remove_unneeded_paths_from_src(want_copies); - num_sources = rename_src_nr; + num_sources = rename_src_nr; /* rename_src_nr reflects lower number */ /* All done? */ if (!num_destinations || !num_sources) @@ -764,7 +808,7 @@ void diffcore_rename(struct diff_options *options) struct diff_score *m; if (rename_dst[i].is_rename) - continue; /* dealt with exact match already. */ + continue; /* exact or basename match already handled */ m = &mx[dst_cnt * NUM_CANDIDATE_PER_DST]; for (j = 0; j < NUM_CANDIDATE_PER_DST; j++) diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh index 797343b38106..bf62537c29a0 100755 --- a/t/t4001-diff-rename.sh +++ b/t/t4001-diff-rename.sh @@ -280,8 +280,8 @@ test_expect_success 'basename similarity vs best similarity' ' # subdir/file.txt is 89% similar to file.md, 78% similar to file.txt, # but since same basenames are checked first... cat >expected <<-\EOF && - R088 subdir/file.txt file.md - A file.txt + A file.md + R078 subdir/file.txt file.txt EOF test_cmp expected actual ' From patchwork Thu Feb 11 08:15:48 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Elijah Newren X-Patchwork-Id: 12082473 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 10B2BC433E9 for ; Thu, 11 Feb 2021 08:17:14 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id DD8A564E9C for ; Thu, 11 Feb 2021 08:17:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229850AbhBKIQ7 (ORCPT ); Thu, 11 Feb 2021 03:16:59 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44936 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229793AbhBKIQg (ORCPT ); Thu, 11 Feb 2021 03:16:36 -0500 Received: from mail-wm1-x32f.google.com (mail-wm1-x32f.google.com [IPv6:2a00:1450:4864:20::32f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C105AC061788 for ; Thu, 11 Feb 2021 00:15:55 -0800 (PST) Received: by mail-wm1-x32f.google.com with SMTP id j11so4420985wmi.3 for ; Thu, 11 Feb 2021 00:15:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:fcc :content-transfer-encoding:mime-version:to:cc; bh=0JDczPChfdRdhwca7LvDeEZ24EDYjAvYQgunfeHMrI0=; b=d3I06eaY2zrXe/F8gShiMjIvul6xJ0myoUxjmNR1NPDvWDTVh+2NhW7zlwsbdiTJiL B39CAgXZUStRxKzFHTuPhHRefLXh7hwZ8Vg3iOLWJCPoIdyEbumyTXIhvAeL+zBu3c7v 5at/JIZUKG66b9qI1HsshImFwDZNKISh1EVtvIts3mL2k3TM+S1EUCKqOyKW1tP8JYC+ Pc3bOz+FnmZvJNzjLocb6VRB/HYNNPmHTSOG75MYoUPbcXRvfntGC5q0Rcw89ITuIIDD T4ablHBd2NRRAW/hB7l9hzew8AuLUnDDbb6Z/EbxR3ER3lo5gvAGEuwGElvGOpMk5txt BhgA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:fcc:content-transfer-encoding:mime-version:to:cc; bh=0JDczPChfdRdhwca7LvDeEZ24EDYjAvYQgunfeHMrI0=; b=l/pefGBRYNPf8lhdn9pzc8t72z1eQTla+dnOkKqjPqiBjLox8/patEzZ7TRcb5w+aR 6oUZD7M9/LQH29LgZo/3GD6e58CM3LqxCdqMNARqLPYHhxm0Cn7XbQZHc8AdQ3WROkko l/gR5d+MAZ5VazfuJvuMIRuMyyCXu4dqeeVT32z9toTjz+AmH3het0CXVWCbKrLqpGFq m4KzIbEV7LKjMZl1u3ni9JIqWsIAx4nRNwczw5I6U/FkxYKvKluvu1eRcoVwGDny2fwY vYSyL8jMXpqdAH2Lj8U/G4nYNQ5HuNlhff6fQB96XZMjjRRMZfzWvMFvh5uhl0AMDh0O ltOA== X-Gm-Message-State: AOAM532LJ00Mhppc6zIZaMJjvdxPkShTh7iKP2C2f4pApY4XEaTMYGtK T/6tVxDEpzoFIpw/KDbJ2Z5O1Nh0sr4= X-Google-Smtp-Source: ABdhPJyJxxOfG+SkjBQp3XzZIrvge+HJ10VVO2vcK/T8shqjlk+EFgDjCl69owteB9CvvuctFjEfzg== X-Received: by 2002:a05:600c:2246:: with SMTP id a6mr3821633wmm.170.1613031354550; Thu, 11 Feb 2021 00:15:54 -0800 (PST) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id t17sm8107607wmi.20.2021.02.11.00.15.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 11 Feb 2021 00:15:54 -0800 (PST) Message-Id: <4e86ed3f29d46add81d5a17046e2c6a574a2fff3.1613031350.git.gitgitgadget@gmail.com> In-Reply-To: References: Date: Thu, 11 Feb 2021 08:15:48 +0000 Subject: [PATCH v4 5/6] gitdiffcore doc: mention new preliminary step for rename detection Fcc: Sent MIME-Version: 1.0 To: git@vger.kernel.org Cc: Derrick Stolee , Jonathan Tan , Taylor Blau , Junio C Hamano , Jeff King , Elijah Newren , Derrick Stolee , Elijah Newren , Elijah Newren Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Elijah Newren From: Elijah Newren The last few patches have introduced a new preliminary step when rename detection is on but both break detection and copy detection are off. Document this new step. While we're at it, add a testcase that checks the new behavior as well. Signed-off-by: Elijah Newren --- Documentation/gitdiffcore.txt | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt index c970d9fe438a..edf92d988c8f 100644 --- a/Documentation/gitdiffcore.txt +++ b/Documentation/gitdiffcore.txt @@ -168,6 +168,26 @@ a similarity score different from the default of 50% by giving a number after the "-M" or "-C" option (e.g. "-M8" to tell it to use 8/10 = 80%). +Note that when rename detection is on but both copy and break +detection are off, rename detection adds a preliminary step that first +checks if files are moved across directories while keeping their +filename the same. If there is a file added to a directory whose +contents is sufficiently similar to a file with the same name that got +deleted from a different directory, it will mark them as renames and +exclude them from the later quadratic step (the one that pairwise +compares all unmatched files to find the "best" matches, determined by +the highest content similarity). So, for example, if a deleted +docs/ext.txt and an added docs/config/ext.txt are similar enough, they +will be marked as a rename and prevent an added docs/ext.md that may +be even more similar to the deleted docs/ext.txt from being considered +as the rename destination in the later step. For this reason, the +preliminary "match same filename" step uses a bit higher threshold to +mark a file pair as a rename and stop considering other candidates for +better matches. At most, one comparison is done per file in this +preliminary pass; so if there are several ext.txt files throughout the +directory hierarchy that were added and deleted, this preliminary step +will be skipped for those files. + Note. When the "-C" option is used with `--find-copies-harder` option, 'git diff-{asterisk}' commands feed unmodified filepairs to diffcore mechanism as well as modified ones. This lets the copy From patchwork Thu Feb 11 08:15:49 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Elijah Newren X-Patchwork-Id: 12082477 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.7 required=3.0 tests=BAYES_00,BIGNUM_EMAILS, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id DF780C433E0 for ; Thu, 11 Feb 2021 08:17:26 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id A5C6864E7D for ; Thu, 11 Feb 2021 08:17:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229626AbhBKIRZ (ORCPT ); Thu, 11 Feb 2021 03:17:25 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45074 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229733AbhBKIRO (ORCPT ); Thu, 11 Feb 2021 03:17:14 -0500 Received: from mail-wm1-x32a.google.com (mail-wm1-x32a.google.com [IPv6:2a00:1450:4864:20::32a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6750EC06178A for ; Thu, 11 Feb 2021 00:15:56 -0800 (PST) Received: by mail-wm1-x32a.google.com with SMTP id w4so4426328wmi.4 for ; Thu, 11 Feb 2021 00:15:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:in-reply-to:references:from:date:subject:mime-version :content-transfer-encoding:fcc:to:cc; bh=MeEArXXMflp+URAUCekXIHJ/+BfExcgxrDq0SnIySvs=; b=IMUURQaQzX6+IyoWcNzAMY7MvNi70hyYxe0Od/eA0TWBAMPOzV1GmoEURZSnvJ4tzb iabx9UPf3Ydf/fIN6Qxcmxrw3LIsQgwWj0ZBH+GY0I0qREQsY0DlJiOGq5Wpk0+IU1Cq OHWlYNBmU9PYewasTJVKWHhWZTb7GPytbRsvq8IChNe5rImjOT5weAwVRieESsgNeWl8 gpNq6RyX4RZ86qcOKKxvD4gCRI1+pSEuhR6FCBnKAxMF697QWnFwUx/paTWFZjb3U6iL mxfFURNhdOv0TTJ8sbK+DpUU8mMEipNOzAFI0w9cIePLPloB4f0lfhrLyx9fDNRDA2Nt gkSQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:in-reply-to:references:from:date :subject:mime-version:content-transfer-encoding:fcc:to:cc; bh=MeEArXXMflp+URAUCekXIHJ/+BfExcgxrDq0SnIySvs=; b=OG4LKykMg9jM03CcBEe1zjEgYHQxaUU1sMh4L4acjXQOk02X3iSTSkQ6SHFzgQr14o sNtCQv5bseVpsdVX6ZY0pJeYk3ecRcFzwyscjExtP92opCQU7QKY+WSikBSnsg3HEWPy FmcfR5OT+FDKdXhjAX4PFuynMhQv2NhG6EnV/3QXBXteey+Kn5j4TJJApujwmne47Hj1 jVN1nzEAFvXnwWYq4JMEfvAtn+LCL3ki7v6HcPNzqA9FOYovdfmxfF9AWxeEyOSJeK47 H9xrx6UCBlvBaa6JqsvwQ9xx0JkTQOrwysuM7UVAC+51NDuawjvJSN8WTdrmOMtoywh4 RJ8g== X-Gm-Message-State: AOAM531wXOH6fJnuE5iGkveOxrZRC/jOFGtX8CjMz0X9B4+mV63VsDzK xsTfdiGzhccRmRrxsHw9yBxlZbx6eyw= X-Google-Smtp-Source: ABdhPJxkvP7jijjCZR6eqYKpqsnn6JSeBhLlrdCfKyUVAgpEXFzFB7WBKsUNHZX0xbFlIKIhiQsCjg== X-Received: by 2002:a1c:29c6:: with SMTP id p189mr3932907wmp.110.1613031355186; Thu, 11 Feb 2021 00:15:55 -0800 (PST) Received: from [127.0.0.1] ([13.74.141.28]) by smtp.gmail.com with ESMTPSA id q19sm8570135wmj.23.2021.02.11.00.15.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 11 Feb 2021 00:15:54 -0800 (PST) Message-Id: In-Reply-To: References: Date: Thu, 11 Feb 2021 08:15:49 +0000 Subject: [PATCH v4 6/6] merge-ort: call diffcore_rename() directly MIME-Version: 1.0 Fcc: Sent To: git@vger.kernel.org Cc: Derrick Stolee , Jonathan Tan , Taylor Blau , Junio C Hamano , Jeff King , Elijah Newren , Derrick Stolee , Elijah Newren , Elijah Newren Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org From: Elijah Newren From: Elijah Newren We want to pass additional information to diffcore_rename() (or some variant thereof) without plumbing that extra information through diff_tree_oid() and diffcore_std(). Further, since we will need to gather additional special information related to diffs and are walking the trees anyway in collect_merge_info(), it seems odd to have diff_tree_oid()/diffcore_std() repeat those tree walks. And there may be times where we can avoid traversing into a subtree in collect_merge_info() (based on additional information at our disposal), that the basic diff logic would be unable to take advantage of. For all these reasons, just create the add and delete pairs ourself and then call diffcore_rename() directly. This change is primarily about enabling future optimizations; the advantage of avoiding extra tree traversals is small compared to the cost of rename detection, and the advantage of avoiding the extra tree traversals is somewhat offset by the extra time spent in collect_merge_info() collecting the additional data anyway. However... For the testcases mentioned in commit 557ac0350d ("merge-ort: begin performance work; instrument with trace2_region_* calls", 2020-10-28), this change improves the performance as follows: Before After no-renames: 13.294 s ± 0.103 s 12.775 s ± 0.062 s mega-renames: 187.248 s ± 0.882 s 188.754 s ± 0.284 s just-one-mega: 5.557 s ± 0.017 s 5.599 s ± 0.019 s Signed-off-by: Elijah Newren --- merge-ort.c | 66 +++++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 59 insertions(+), 7 deletions(-) diff --git a/merge-ort.c b/merge-ort.c index 931b91438cf1..603d30c52170 100644 --- a/merge-ort.c +++ b/merge-ort.c @@ -535,6 +535,23 @@ static void setup_path_info(struct merge_options *opt, result->util = mi; } +static void add_pair(struct merge_options *opt, + struct name_entry *names, + const char *pathname, + unsigned side, + unsigned is_add /* if false, is_delete */) +{ + struct diff_filespec *one, *two; + struct rename_info *renames = &opt->priv->renames; + int names_idx = is_add ? side : 0; + + one = alloc_filespec(pathname); + two = alloc_filespec(pathname); + fill_filespec(is_add ? two : one, + &names[names_idx].oid, 1, names[names_idx].mode); + diff_queue(&renames->pairs[side], one, two); +} + static void collect_rename_info(struct merge_options *opt, struct name_entry *names, const char *dirname, @@ -544,6 +561,7 @@ static void collect_rename_info(struct merge_options *opt, unsigned match_mask) { struct rename_info *renames = &opt->priv->renames; + unsigned side; /* Update dirs_removed, as needed */ if (dirmask == 1 || dirmask == 3 || dirmask == 5) { @@ -554,6 +572,21 @@ static void collect_rename_info(struct merge_options *opt, if (sides & 2) strset_add(&renames->dirs_removed[2], fullname); } + + if (filemask == 0 || filemask == 7) + return; + + for (side = MERGE_SIDE1; side <= MERGE_SIDE2; ++side) { + unsigned side_mask = (1 << side); + + /* Check for deletion on side */ + if ((filemask & 1) && !(filemask & side_mask)) + add_pair(opt, names, fullname, side, 0 /* delete */); + + /* Check for addition on side */ + if (!(filemask & 1) && (filemask & side_mask)) + add_pair(opt, names, fullname, side, 1 /* add */); + } } static int collect_merge_info_callback(int n, @@ -2079,6 +2112,27 @@ static int process_renames(struct merge_options *opt, return clean_merge; } +static void resolve_diffpair_statuses(struct diff_queue_struct *q) +{ + /* + * A simplified version of diff_resolve_rename_copy(); would probably + * just use that function but it's static... + */ + int i; + struct diff_filepair *p; + + for (i = 0; i < q->nr; ++i) { + p = q->queue[i]; + p->status = 0; /* undecided */ + if (!DIFF_FILE_VALID(p->one)) + p->status = DIFF_STATUS_ADDED; + else if (!DIFF_FILE_VALID(p->two)) + p->status = DIFF_STATUS_DELETED; + else if (DIFF_PAIR_RENAME(p)) + p->status = DIFF_STATUS_RENAMED; + } +} + static int compare_pairs(const void *a_, const void *b_) { const struct diff_filepair *a = *((const struct diff_filepair **)a_); @@ -2089,8 +2143,6 @@ static int compare_pairs(const void *a_, const void *b_) /* Call diffcore_rename() to compute which files have changed on given side */ static void detect_regular_renames(struct merge_options *opt, - struct tree *merge_base, - struct tree *side, unsigned side_index) { struct diff_options diff_opts; @@ -2108,11 +2160,11 @@ static void detect_regular_renames(struct merge_options *opt, diff_opts.output_format = DIFF_FORMAT_NO_OUTPUT; diff_setup_done(&diff_opts); + diff_queued_diff = renames->pairs[side_index]; trace2_region_enter("diff", "diffcore_rename", opt->repo); - diff_tree_oid(&merge_base->object.oid, &side->object.oid, "", - &diff_opts); - diffcore_std(&diff_opts); + diffcore_rename(&diff_opts); trace2_region_leave("diff", "diffcore_rename", opt->repo); + resolve_diffpair_statuses(&diff_queued_diff); if (diff_opts.needed_rename_limit > renames->needed_limit) renames->needed_limit = diff_opts.needed_rename_limit; @@ -2212,8 +2264,8 @@ static int detect_and_process_renames(struct merge_options *opt, memset(&combined, 0, sizeof(combined)); trace2_region_enter("merge", "regular renames", opt->repo); - detect_regular_renames(opt, merge_base, side1, MERGE_SIDE1); - detect_regular_renames(opt, merge_base, side2, MERGE_SIDE2); + detect_regular_renames(opt, MERGE_SIDE1); + detect_regular_renames(opt, MERGE_SIDE2); trace2_region_leave("merge", "regular renames", opt->repo); trace2_region_enter("merge", "directory renames", opt->repo);