From patchwork Mon Jun 10 15:30:13 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Barret Rhoden X-Patchwork-Id: 10985089 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 033AC13AF for ; Mon, 10 Jun 2019 15:30:49 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E7D9D287C0 for ; Mon, 10 Jun 2019 15:30:48 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id DC8F8285C9; Mon, 10 Jun 2019 15:30:48 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.5 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI, USER_IN_DEF_DKIM_WL autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5EBD7285C3 for ; Mon, 10 Jun 2019 15:30:48 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2391228AbfFJPar (ORCPT ); Mon, 10 Jun 2019 11:30:47 -0400 Received: from mail-qk1-f201.google.com ([209.85.222.201]:44329 "EHLO mail-qk1-f201.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2391204AbfFJPaq (ORCPT ); Mon, 10 Jun 2019 11:30:46 -0400 Received: by mail-qk1-f201.google.com with SMTP id c207so5062015qkb.11 for ; Mon, 10 Jun 2019 08:30:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=5zKAN9m9KIwISLLUxQ5SZ9B/gPEYRfPMBGleQOfHUzQ=; b=ou8Dhl0YLdfL8zJeeymGITPnXpovn91vp6XEmRaZ1wNPSTireMp2UPkDNMwF8HTmgC 7WxYFznnzBEs0Lrr0wizSXmkCaUmwzWXzj2EIMB48TYXSL+VzAyx+ji6K8afHaOD9OkC +dnSk4pFbsPs5bq1CySfKy0zYrPr6JJuayD8M/AxssHE2X0M3DM3U23v1VORwwaINKwR ToW79yWidj0fwidyagwtS9cLKTaI5vhhEEYwWuz3aIsNKSa7NJIbbKxYyTyN1/Arkft/ 8z6I2lbCpOGKKuHTaNt7eAUIacfgQ2nH9N9Sx6lcwHy0kdsl9pVlV3IJ9xpFALG31MFt eeNg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=5zKAN9m9KIwISLLUxQ5SZ9B/gPEYRfPMBGleQOfHUzQ=; b=ooueTXPYvs03OWvmAAn3rtOKCJKXiQ4vWFne4+9x62vCilNmUFk6jEORx7/M7WpN6A y0sCx50TQu8SbhS77oevbYF6WkVOGkcbaH9DcDfm3cL7PH2KcbwAIAnS/i0Lrn2OGn04 9nHpOBj7HuGib8oiK7SQMwLpU63wZzORz49fXXhSO+cYtcnclBpkZy8dcRYOXpLx/HpO 6vRjCDA53PhLkSR0pVOJzHnzZ9SyPusxYHJ2S7Hv9vSpa++yjvRYHQMpSy/X/mTDxP72 v1TvYXPekv1OuMS1d0PYsWvotfxeFwDcre9lWsVzQ8g2rq1WzmS/z5qyRoL+LW4Yo9uD TxdQ== X-Gm-Message-State: APjAAAVHPXfizOJzkAwpf81H8kPHuAd6T0ow1blF/fRqsgmGLty+hDXH b/56w8KD5l0n20cJEPntyovpBdlehvDerJLVbAmJCyu+nW8COSJwmY9HGJEhD1u0D4SV+GQOyEt LqkLOfy9IJ8xJQY6IqDWF8d9e/D4643ZTY3qZT4qzEKV2A8Zs7vcu X-Google-Smtp-Source: APXvYqxnKWTs6d7x0vIo0J2PCfoJgrcdGh/oyCkEEPEpQgi/BAS3M5OoTv4xvyHHWt28xl5XInkdM7od X-Received: by 2002:a0c:94d6:: with SMTP id k22mr5725566qvk.58.1560180645147; Mon, 10 Jun 2019 08:30:45 -0700 (PDT) Date: Mon, 10 Jun 2019 11:30:13 -0400 In-Reply-To: <20190610153014.42055-1-brho@google.com> Message-Id: <20190610153014.42055-9-brho@google.com> Mime-Version: 1.0 References: <20190610153014.42055-1-brho@google.com> X-Mailer: git-send-email 2.22.0.rc2.383.gf4fbbf30c2-goog Subject: [PATCH v8 8/9] blame: use the fingerprint heuristic to match ignored lines From: Barret Rhoden To: git@vger.kernel.org Cc: " =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= " , David Kastrup , Jeff King , Jeff Smith , Johannes Schindelin , Junio C Hamano , " =?utf-8?q?Ren=C3=A9_Scharfe?= " , Stefan Beller , Michael Platings Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP This commit integrates the fuzzy fingerprint heuristic into guess_line_blames(). We actually make two passes. The first pass uses the fuzzy algorithm to find a match within the current diff chunk. If that fails, the second pass searches the entire parent file for the best match. For an example of scanning the entire parent for a match, consider: commit-a 30) #include commit-b 31) #include commit-c 32) #include Then commit X alphabetizes them: commit-X 30) #include commit-X 31) #include commit-X 32) #include If we just check the parent's chunk (i.e. the first pass), we'd get: commit-b 30) #include commit-c 31) #include commit-X 32) #include That's because commit X actually consists of two chunks: one chunk is removing sys/header_a.h, then some context, and the second chunk is adding sys/header_a.h. If we scan the entire parent file, we get: commit-b 30) #include commit-c 31) #include commit-a 32) #include Signed-off-by: Barret Rhoden --- blame.c | 60 ++++++++++++++++++++++++++++++++--- t/t8014-blame-ignore-fuzzy.sh | 3 -- 2 files changed, 55 insertions(+), 8 deletions(-) diff --git a/blame.c b/blame.c index 103838546e07..f81ec9a8cf80 100644 --- a/blame.c +++ b/blame.c @@ -990,12 +990,19 @@ static void fill_origin_fingerprints(struct blame_origin *o, mmfile_t *file) return; o->num_lines = find_line_starts(&line_starts, o->file.ptr, o->file.size); - /* TODO: Will fill in fingerprints in a future commit */ + o->fingerprints = xcalloc(sizeof(struct fingerprint), o->num_lines); + get_line_fingerprints(o->fingerprints, o->file.ptr, line_starts, + 0, o->num_lines); free(line_starts); } static void drop_origin_fingerprints(struct blame_origin *o) { + if (o->fingerprints) { + free_line_fingerprints(o->fingerprints, o->num_lines); + o->num_lines = 0; + FREE_AND_NULL(o->fingerprints); + } } /* @@ -1573,9 +1580,34 @@ static int are_lines_adjacent(struct blame_line_tracker *first, first->s_lno + 1 == second->s_lno; } +static int scan_parent_range(struct fingerprint *p_fps, + struct fingerprint *t_fps, int t_idx, + int from, int nr_lines) +{ + int sim, p_idx; + #define FINGERPRINT_FILE_THRESHOLD 10 + int best_sim_val = FINGERPRINT_FILE_THRESHOLD; + int best_sim_idx = -1; + + for (p_idx = from; p_idx < from + nr_lines; p_idx++) { + sim = fingerprint_similarity(&t_fps[t_idx], &p_fps[p_idx]); + if (sim < best_sim_val) + continue; + /* Break ties with the closest-to-target line number */ + if (sim == best_sim_val && best_sim_idx != -1 && + abs(best_sim_idx - t_idx) < abs(p_idx - t_idx)) + continue; + best_sim_val = sim; + best_sim_idx = p_idx; + } + return best_sim_idx; +} + /* - * This cheap heuristic assigns lines in the chunk to their relative location in - * the parent's chunk. Any additional lines are left with the target. + * The first pass checks the blame entry (from the target) against the parent's + * diff chunk. If that fails for a line, the second pass tries to match that + * line to any part of parent file. That catches cases where a change was + * broken into two chunks by 'context.' */ static void guess_line_blames(struct blame_origin *parent, struct blame_origin *target, @@ -1584,11 +1616,22 @@ static void guess_line_blames(struct blame_origin *parent, { int i, best_idx, target_idx; int parent_slno = tlno + offset; + int *fuzzy_matches; + fuzzy_matches = fuzzy_find_matching_lines(parent, target, + tlno, parent_slno, same, + parent_len); for (i = 0; i < same - tlno; i++) { target_idx = tlno + i; - best_idx = target_idx + offset; - if (best_idx < parent_slno + parent_len) { + if (fuzzy_matches && fuzzy_matches[i] >= 0) { + best_idx = fuzzy_matches[i]; + } else { + best_idx = scan_parent_range(parent->fingerprints, + target->fingerprints, + target_idx, 0, + parent->num_lines); + } + if (best_idx >= 0) { line_blames[i].is_parent = 1; line_blames[i].s_lno = best_idx; } else { @@ -1596,6 +1639,7 @@ static void guess_line_blames(struct blame_origin *parent, line_blames[i].s_lno = target_idx; } } + free(fuzzy_matches); } /* @@ -2372,6 +2416,12 @@ static void pass_blame(struct blame_scoreboard *sb, struct blame_origin *origin, if (!porigin) continue; pass_blame_to_parent(sb, origin, porigin, 1); + /* + * Preemptively drop porigin so we can refresh the + * fingerprints if we use the parent again, which can + * occur if you ignore back-to-back commits. + */ + drop_origin_blob(porigin); if (!origin->suspects) goto finish; } diff --git a/t/t8014-blame-ignore-fuzzy.sh b/t/t8014-blame-ignore-fuzzy.sh index 1d8fa1da74c9..1ff59624e915 100755 --- a/t/t8014-blame-ignore-fuzzy.sh +++ b/t/t8014-blame-ignore-fuzzy.sh @@ -3,9 +3,6 @@ test_description='git blame ignore fuzzy heuristic' . ./test-lib.sh -# short circuit until blame has the fuzzy capabilities -test_done - pick_author='s/^[0-9a-f^]* *(\([^ ]*\) .*/\1/' # Each test is composed of 4 variables: