From patchwork Thu Apr 27 08:14:06 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jeff King X-Patchwork-Id: 13225247 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8A459C77B73 for ; Thu, 27 Apr 2023 08:14:28 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242970AbjD0IO1 (ORCPT ); Thu, 27 Apr 2023 04:14:27 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38722 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S242966AbjD0IOZ (ORCPT ); Thu, 27 Apr 2023 04:14:25 -0400 Received: from cloud.peff.net (cloud.peff.net [104.130.231.41]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BB68549E7 for ; Thu, 27 Apr 2023 01:14:07 -0700 (PDT) Received: (qmail 20674 invoked by uid 109); 27 Apr 2023 08:14:07 -0000 Received: from Unknown (HELO peff.net) (10.0.1.2) by cloud.peff.net (qpsmtpd/0.94) with ESMTP; Thu, 27 Apr 2023 08:14:07 +0000 Authentication-Results: cloud.peff.net; auth=none Received: (qmail 17249 invoked by uid 111); 27 Apr 2023 08:14:06 -0000 Received: from coredump.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.2) by peff.net (qpsmtpd/0.94) with (TLS_AES_256_GCM_SHA384 encrypted) ESMTPS; Thu, 27 Apr 2023 04:14:06 -0400 Authentication-Results: peff.net; auth=none Date: Thu, 27 Apr 2023 04:14:06 -0400 From: Jeff King To: Junio C Hamano Cc: Phillip Wood , Thomas Bock , Derrick Stolee , git@vger.kernel.org, =?utf-8?b?UmVuw6k=?= Scharfe Subject: [PATCH v3 1/4] t4212: avoid putting git on left-hand side of pipe Message-ID: <20230427081406.GA1477912@coredump.intra.peff.net> References: <20230427081330.GA1461786@coredump.intra.peff.net> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20230427081330.GA1461786@coredump.intra.peff.net> Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org We wouldn't expect cat-file to fail here, but it's good practice to avoid putting git on the upstream of a pipe, as we otherwise ignore its exit code. Signed-off-by: Jeff King --- t/t4212-log-corrupt.sh | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/t/t4212-log-corrupt.sh b/t/t4212-log-corrupt.sh index e89e1f54b6..8b5433ea74 100755 --- a/t/t4212-log-corrupt.sh +++ b/t/t4212-log-corrupt.sh @@ -8,8 +8,9 @@ TEST_PASSES_SANITIZE_LEAK=true test_expect_success 'setup' ' test_commit foo && - git cat-file commit HEAD | - sed "/^author /s/>/>-<>/" >broken_email.commit && + git cat-file commit HEAD >ok.commit && + sed "/^author /s/>/>-<>/" broken_email.commit && + git hash-object --literally -w -t commit broken_email.commit >broken_email.hash && git update-ref refs/heads/broken_email $(cat broken_email.hash) ' From patchwork Thu Apr 27 08:14:09 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jeff King X-Patchwork-Id: 13225248 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0CE8DC77B61 for ; Thu, 27 Apr 2023 08:14:30 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242586AbjD0IO2 (ORCPT ); Thu, 27 Apr 2023 04:14:28 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38754 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S242985AbjD0IO0 (ORCPT ); Thu, 27 Apr 2023 04:14:26 -0400 Received: from cloud.peff.net (cloud.peff.net [104.130.231.41]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9615144A9 for ; Thu, 27 Apr 2023 01:14:10 -0700 (PDT) Received: (qmail 20692 invoked by uid 109); 27 Apr 2023 08:14:09 -0000 Received: from Unknown (HELO peff.net) (10.0.1.2) by cloud.peff.net (qpsmtpd/0.94) with ESMTP; Thu, 27 Apr 2023 08:14:09 +0000 Authentication-Results: cloud.peff.net; auth=none Received: (qmail 17260 invoked by uid 111); 27 Apr 2023 08:14:09 -0000 Received: from coredump.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.2) by peff.net (qpsmtpd/0.94) with (TLS_AES_256_GCM_SHA384 encrypted) ESMTPS; Thu, 27 Apr 2023 04:14:09 -0400 Authentication-Results: peff.net; auth=none Date: Thu, 27 Apr 2023 04:14:09 -0400 From: Jeff King To: Junio C Hamano Cc: Phillip Wood , Thomas Bock , Derrick Stolee , git@vger.kernel.org, =?utf-8?b?UmVuw6k=?= Scharfe Subject: [PATCH v3 2/4] parse_commit(): parse timestamp from end of line Message-ID: <20230427081409.GB1477912@coredump.intra.peff.net> References: <20230427081330.GA1461786@coredump.intra.peff.net> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20230427081330.GA1461786@coredump.intra.peff.net> Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org To find the committer timestamp, we parse left-to-right looking for the closing ">" of the email, and then expect the timestamp right after that. But we've seen some broken cases in the wild where this fails, but we _could_ find the timestamp with a little extra work. E.g.: Name > 123456789 -0500 This means that features that rely on the committer timestamp, like --since or --until, will treat the commit as happening at time 0 (i.e., 1970). This is doubly confusing because the pretty-print parser learned to handle these in 03818a4a94 (split_ident: parse timestamp from end of line, 2013-10-14). So printing them via "git show", etc, makes everything look normal, but --until, etc are still broken (despite the fact that that commit explicitly mentioned --until!). So let's use the same trick as 03818a4a94: find the end of the line, and parse back to the final ">". In theory we could use split_ident_line() here, but it's actually a bit more strict. In particular, it requires a valid time-zone token, too. That should be present, of course, but we wouldn't want to break --until for cases that are working currently. We might want to teach split_ident_line() to become more lenient there, but it would require checking its many callers (since right now they can assume that if date_start is non-NULL, so is tz_start). So for now we'll just reimplement the same trick in the commit parser. The test is in t4212, which already covers similar cases, courtesy of 03818a4a94. We'll just adjust the broken commit to munge both the author and committer timestamps. Note that we could match (author|committer) here, but alternation can't be used portably in sed. Since we wouldn't expect to see ">" except as part of an ident line, we can just match that character on any line. Signed-off-by: Jeff King --- commit.c | 24 ++++++++++++++++-------- t/t4212-log-corrupt.sh | 7 ++++++- 2 files changed, 22 insertions(+), 9 deletions(-) diff --git a/commit.c b/commit.c index 878b4473e4..04c20d9cc6 100644 --- a/commit.c +++ b/commit.c @@ -96,6 +96,7 @@ struct commit *lookup_commit_reference_by_name(const char *name) static timestamp_t parse_commit_date(const char *buf, const char *tail) { const char *dateptr; + const char *eol; if (buf + 6 >= tail) return 0; @@ -107,16 +108,23 @@ static timestamp_t parse_commit_date(const char *buf, const char *tail) return 0; if (memcmp(buf, "committer", 9)) return 0; - while (buf < tail && *buf++ != '>') - /* nada */; - if (buf >= tail) + + /* + * Jump to end-of-line so that we can walk backwards to find the + * end-of-email ">". This is more forgiving of malformed cases + * because unexpected characters tend to be in the name and email + * fields. + */ + eol = memchr(buf, '\n', tail - buf); + if (!eol) return 0; - dateptr = buf; - while (buf < tail && *buf++ != '\n') - /* nada */; - if (buf >= tail) + dateptr = eol; + while (dateptr > buf && dateptr[-1] != '>') + dateptr--; + if (dateptr == buf || dateptr == eol) return 0; - /* dateptr < buf && buf[-1] == '\n', so parsing will stop at buf-1 */ + + /* dateptr < eol && *eol == '\n', so parsing will stop at eol */ return parse_timestamp(dateptr, NULL, 10); } diff --git a/t/t4212-log-corrupt.sh b/t/t4212-log-corrupt.sh index 8b5433ea74..af4b35ff56 100755 --- a/t/t4212-log-corrupt.sh +++ b/t/t4212-log-corrupt.sh @@ -9,7 +9,7 @@ test_expect_success 'setup' ' test_commit foo && git cat-file commit HEAD >ok.commit && - sed "/^author /s/>/>-<>/" broken_email.commit && + sed "s/>/>-<>/" broken_email.commit && git hash-object --literally -w -t commit broken_email.commit >broken_email.hash && git update-ref refs/heads/broken_email $(cat broken_email.hash) @@ -44,6 +44,11 @@ test_expect_success 'git log --format with broken author email' ' test_must_be_empty actual.err ' +test_expect_success '--until handles broken email' ' + git rev-list --until=1980-01-01 broken_email >actual && + test_must_be_empty actual +' + munge_author_date () { git cat-file commit "$1" >commit.orig && sed "s/^\(author .*>\) [0-9]*/\1 $2/" commit.munge && From patchwork Thu Apr 27 08:17:15 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jeff King X-Patchwork-Id: 13225249 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6EDCFC77B73 for ; Thu, 27 Apr 2023 08:17:20 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242966AbjD0IRT (ORCPT ); Thu, 27 Apr 2023 04:17:19 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40332 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239395AbjD0IRS (ORCPT ); Thu, 27 Apr 2023 04:17:18 -0400 Received: from cloud.peff.net (cloud.peff.net [104.130.231.41]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2AD014209 for ; Thu, 27 Apr 2023 01:17:17 -0700 (PDT) Received: (qmail 20711 invoked by uid 109); 27 Apr 2023 08:17:16 -0000 Received: from Unknown (HELO peff.net) (10.0.1.2) by cloud.peff.net (qpsmtpd/0.94) with ESMTP; Thu, 27 Apr 2023 08:17:16 +0000 Authentication-Results: cloud.peff.net; auth=none Received: (qmail 17294 invoked by uid 111); 27 Apr 2023 08:17:16 -0000 Received: from coredump.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.2) by peff.net (qpsmtpd/0.94) with (TLS_AES_256_GCM_SHA384 encrypted) ESMTPS; Thu, 27 Apr 2023 04:17:16 -0400 Authentication-Results: peff.net; auth=none Date: Thu, 27 Apr 2023 04:17:15 -0400 From: Jeff King To: Junio C Hamano Cc: Phillip Wood , Thomas Bock , Derrick Stolee , git@vger.kernel.org, =?utf-8?b?UmVuw6k=?= Scharfe Subject: [PATCH v3 3/4] parse_commit(): handle broken whitespace-only timestamp Message-ID: <20230427081715.GA1478467@coredump.intra.peff.net> References: <20230427081330.GA1461786@coredump.intra.peff.net> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20230427081330.GA1461786@coredump.intra.peff.net> Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org The comment in parse_commit_date() claims that parse_timestamp() will not walk past the end of the buffer we've been given, since it will hit the newline at "eol" and stop. This is usually true, when dateptr contains actual numbers to parse. But with a line like: committer name \n with just whitespace, and no numbers, parse_timestamp() will consume that newline as part of the leading whitespace, and we may walk past our "tail" pointer (which itself is set from the "size" parameter passed in to parse_commit_buffer()). In practice this can't cause us to walk off the end of an array, because we always add an extra NUL byte to the end of objects we load from disk (as a defense against exactly this kind of bug). However, you can see the behavior in action when "committer" is the final header (which it usually is, unless there's an encoding) and the subject line can be parsed as an integer. We walk right past the newline on the committer line, as well as the "\n\n" separator, and mistake the subject for the timestamp. We can solve this by trimming the whitespace ourselves, making sure that it has some non-whitespace to parse. Note that we need to be a bit careful about the definition of "whitespace" here, as our isspace() doesn't match exotic characters like vertical tab or formfeed. We can work around that by checking for an actual number (see the in-code comment). This is slightly more restrictive than the current code, but in practice the results are either the same (we reject "foo" as "0", but so would parse_timestamp()) or extremely unlikely even for broken commits (parse_timestamp() would allow "\v123" as "123", but we'll now make it "0"). I did also allow "-" here, which may be controversial, as we don't currently support negative timestamps. My reasoning was two-fold. One, the design of parse_timestamp() is such that we should be able to easily switch it to handling signed values, and this otherwise creates a hard-to-find gotcha that anybody doing that work would get tripped up on. And two, the status quo is that we currently parse them, though the result of course ends up as a very large unsigned value (which is likely to just get clamped to "0" for display anyway, since our date routines can't handle it). The new test checks the commit parser (via "--until") for both vanilla spaces and the vertical-tab case. I also added a test to check these against the pretty-print formatter, which uses split_ident_line(). It's not subject to the same bug, because it already insists that there be one or more digits in the timestamp. Helped-by: Phillip Wood Signed-off-by: Jeff King --- commit.c | 28 ++++++++++++++++++++++++++-- t/t4212-log-corrupt.sh | 41 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 67 insertions(+), 2 deletions(-) diff --git a/commit.c b/commit.c index 04c20d9cc6..8dfe92cf37 100644 --- a/commit.c +++ b/commit.c @@ -121,10 +121,34 @@ static timestamp_t parse_commit_date(const char *buf, const char *tail) dateptr = eol; while (dateptr > buf && dateptr[-1] != '>') dateptr--; - if (dateptr == buf || dateptr == eol) + if (dateptr == buf) return 0; - /* dateptr < eol && *eol == '\n', so parsing will stop at eol */ + /* + * Trim leading whitespace, but make sure we have at least one + * non-whitespace character, as parse_timestamp() will otherwise walk + * right past the newline we found in "eol" when skipping whitespace + * itself. + * + * In theory it would be sufficient to allow any character not matched + * by isspace(), but there's a catch: our isspace() does not + * necessarily match the behavior of parse_timestamp(), as the latter + * is implemented by system routines which match more exotic control + * codes, or even locale-dependent sequences. + * + * Since we expect the timestamp to be a number, we can check for that. + * Anything else (e.g., a non-numeric token like "foo") would just + * cause parse_timestamp() to return 0 anyway. + */ + while (dateptr < eol && isspace(*dateptr)) + dateptr++; + if (!isdigit(*dateptr) && *dateptr != '-') + return 0; + + /* + * We know there is at least one digit (or dash), so we'll begin + * parsing there and stop at worst case at eol. + */ return parse_timestamp(dateptr, NULL, 10); } diff --git a/t/t4212-log-corrupt.sh b/t/t4212-log-corrupt.sh index af4b35ff56..85e90acb09 100755 --- a/t/t4212-log-corrupt.sh +++ b/t/t4212-log-corrupt.sh @@ -92,4 +92,45 @@ test_expect_success 'absurdly far-in-future date' ' git log -1 --format=%ad $commit ' +test_expect_success 'create commits with whitespace committer dates' ' + # It is important that this subject line is numeric, since we want to + # be sure we are not confused by skipping whitespace and accidentally + # parsing the subject as a timestamp. + # + # Do not use munge_author_date here. Besides not hitting the committer + # line, it leaves the timezone intact, and we want nothing but + # whitespace. + # + # We will make two munged commits here. The first, ws_commit, will + # be purely spaces. The second contains a vertical tab, which is + # considered a space by strtoumax(), but not by our isspace(). + test_commit 1234567890 && + git cat-file commit HEAD >commit.orig && + sed "s/>.*/> /" commit.munge && + ws_commit=$(git hash-object --literally -w -t commit commit.munge) && + sed "s/>.*/> $(printf "\013")/" commit.munge && + vt_commit=$(git hash-object --literally -w -t commit commit.munge) +' + +test_expect_success '--until treats whitespace date as sentinel' ' + echo $ws_commit >expect && + git rev-list --until=1980-01-01 $ws_commit >actual && + test_cmp expect actual && + + echo $vt_commit >expect && + git rev-list --until=1980-01-01 $vt_commit >actual && + test_cmp expect actual +' + +test_expect_success 'pretty-printer handles whitespace date' ' + # as with the %ad test above, we will show these as the empty string, + # not the 1970 epoch date. This is intentional; see 7d9a281941 (t4212: + # test bogus timestamps with git-log, 2014-02-24) for more discussion. + echo : >expect && + git log -1 --format="%at:%ct" $ws_commit >actual && + test_cmp expect actual && + git log -1 --format="%at:%ct" $vt_commit >actual && + test_cmp expect actual +' + test_done From patchwork Thu Apr 27 08:17:24 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jeff King X-Patchwork-Id: 13225250 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0DF45C77B73 for ; Thu, 27 Apr 2023 08:17:31 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242989AbjD0IRa (ORCPT ); Thu, 27 Apr 2023 04:17:30 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40438 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243023AbjD0IR1 (ORCPT ); Thu, 27 Apr 2023 04:17:27 -0400 Received: from cloud.peff.net (cloud.peff.net [104.130.231.41]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 00BC544A2 for ; Thu, 27 Apr 2023 01:17:25 -0700 (PDT) Received: (qmail 20729 invoked by uid 109); 27 Apr 2023 08:17:25 -0000 Received: from Unknown (HELO peff.net) (10.0.1.2) by cloud.peff.net (qpsmtpd/0.94) with ESMTP; Thu, 27 Apr 2023 08:17:25 +0000 Authentication-Results: cloud.peff.net; auth=none Received: (qmail 17305 invoked by uid 111); 27 Apr 2023 08:17:25 -0000 Received: from coredump.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.2) by peff.net (qpsmtpd/0.94) with (TLS_AES_256_GCM_SHA384 encrypted) ESMTPS; Thu, 27 Apr 2023 04:17:25 -0400 Authentication-Results: peff.net; auth=none Date: Thu, 27 Apr 2023 04:17:24 -0400 From: Jeff King To: Junio C Hamano Cc: Phillip Wood , Thomas Bock , Derrick Stolee , git@vger.kernel.org, =?utf-8?b?UmVuw6k=?= Scharfe Subject: [PATCH v3 4/4] parse_commit(): describe more date-parsing failure modes Message-ID: <20230427081724.GB1478467@coredump.intra.peff.net> References: <20230427081330.GA1461786@coredump.intra.peff.net> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20230427081330.GA1461786@coredump.intra.peff.net> Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org The previous few commits improved the parsing of dates in malformed commit objects. But there's one big case left implicit: we may still feed garbage to parse_timestamp(). This is preferable to trying to be more strict, but let's document the thinking in a comment. Signed-off-by: Jeff King --- commit.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/commit.c b/commit.c index 8dfe92cf37..e2e4fd2db9 100644 --- a/commit.c +++ b/commit.c @@ -148,6 +148,15 @@ static timestamp_t parse_commit_date(const char *buf, const char *tail) /* * We know there is at least one digit (or dash), so we'll begin * parsing there and stop at worst case at eol. + * + * Note that we may feed parse_timestamp() extra characters here if the + * commit is malformed, and it will parse as far as it can. For + * example, "123foo456" would return "123". That might be questionable + * (versus returning "0"), but it would help in a hypothetical case + * like "123456+0100", where the whitespace from the timezone is + * missing. Since such syntactic errors may be baked into history and + * hard to correct now, let's err on trying to make our best guess + * here, rather than insist on perfect syntax. */ return parse_timestamp(dateptr, NULL, 10); }