From patchwork Thu Jun 1 18:03:51 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jeff King X-Patchwork-Id: 13264362 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9FD0EC77B7E for ; Thu, 1 Jun 2023 18:03:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231599AbjFASD5 (ORCPT ); Thu, 1 Jun 2023 14:03:57 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47408 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229651AbjFASDy (ORCPT ); Thu, 1 Jun 2023 14:03:54 -0400 Received: from cloud.peff.net (cloud.peff.net [104.130.231.41]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2DDB58E for ; Thu, 1 Jun 2023 11:03:53 -0700 (PDT) Received: (qmail 6232 invoked by uid 109); 1 Jun 2023 18:03:52 -0000 Received: from Unknown (HELO peff.net) (10.0.1.2) by cloud.peff.net (qpsmtpd/0.94) with ESMTP; Thu, 01 Jun 2023 18:03:52 +0000 Authentication-Results: cloud.peff.net; auth=none Received: (qmail 25510 invoked by uid 111); 1 Jun 2023 18:03:52 -0000 Received: from coredump.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.2) by peff.net (qpsmtpd/0.94) with (TLS_AES_256_GCM_SHA384 encrypted) ESMTPS; Thu, 01 Jun 2023 14:03:52 -0400 Authentication-Results: peff.net; auth=none Date: Thu, 1 Jun 2023 14:03:51 -0400 From: Jeff King To: git@vger.kernel.org Cc: Teng Long , Junio C Hamano Subject: [PATCH 1/3] ci: use clang for ASan/UBSan checks Message-ID: <20230601180351.GA4167886@coredump.intra.peff.net> References: <20230601180220.GA4167745@coredump.intra.peff.net> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20230601180220.GA4167745@coredump.intra.peff.net> Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Both gcc and clang support the "address" and "undefined" sanitizers. However, they may produce different results. We've seen at least two real world cases where gcc missed a UBSan problem but clang found it: 1. Clang's UBSan (using clang 14.0.6) found a string index that was subtracted to "-1", causing an out-of-bounds read (curiously this didn't trigger ASan, but that may be because the string was in the argv memory, not stack or heap). Using gcc (version 12.2.0) didn't find the same problem. Original thread: https://lore.kernel.org/git/20230519005447.GA2955320@coredump.intra.peff.net/ 2. Clang's UBSan (using clang 4.0.1) complained about pointer arithmetic with NULL, but gcc at the time did not. This was in 2017, and modern gcc does seem to find the issue, though. Original thread: https://lore.kernel.org/git/32a8b949-638a-1784-7fba-948ae32206fc@web.de/ Since we don't otherwise have a particular preference for one compiler over the other for this test, let's switch to the one that we think may be more thorough. Note that it's entirely possible that the two are simply _different_, and we are trading off problems that gcc would find that clang wouldn't. However, my subjective and anecdotal experience has been that clang's sanitizer support is a bit more mature (e.g., I recall other oddities around leak-checking where clang performed more sensibly). Obviously running both and cross-checking the results would give us the best coverage, but that's very expensive to run (and these are already some of our most expensive CI jobs). So let's use clang as our best guess, and we can re-evaluate if we get more data points. Signed-off-by: Jeff King --- You can see that this catches the recent problem when merged with 'next': https://github.com/peff/git/actions/runs/5102807385/jobs/9172605638 .github/workflows/main.yml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml index 30492eacdd..487ad31e66 100644 --- a/.github/workflows/main.yml +++ b/.github/workflows/main.yml @@ -274,10 +274,10 @@ jobs: cc: gcc pool: ubuntu-latest - jobname: linux-asan - cc: gcc + cc: clang pool: ubuntu-latest - jobname: linux-ubsan - cc: gcc + cc: clang pool: ubuntu-latest env: CC: ${{matrix.vector.cc}} From patchwork Thu Jun 1 18:09:47 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jeff King X-Patchwork-Id: 13264364 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 09463C77B7A for ; Thu, 1 Jun 2023 18:10:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229603AbjFASKB (ORCPT ); Thu, 1 Jun 2023 14:10:01 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50020 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231628AbjFASJu (ORCPT ); Thu, 1 Jun 2023 14:09:50 -0400 Received: from cloud.peff.net (cloud.peff.net [104.130.231.41]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 309BD193 for ; Thu, 1 Jun 2023 11:09:49 -0700 (PDT) Received: (qmail 6328 invoked by uid 109); 1 Jun 2023 18:09:48 -0000 Received: from Unknown (HELO peff.net) (10.0.1.2) by cloud.peff.net (qpsmtpd/0.94) with ESMTP; Thu, 01 Jun 2023 18:09:48 +0000 Authentication-Results: cloud.peff.net; auth=none Received: (qmail 25556 invoked by uid 111); 1 Jun 2023 18:09:48 -0000 Received: from coredump.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.2) by peff.net (qpsmtpd/0.94) with (TLS_AES_256_GCM_SHA384 encrypted) ESMTPS; Thu, 01 Jun 2023 14:09:48 -0400 Authentication-Results: peff.net; auth=none Date: Thu, 1 Jun 2023 14:09:47 -0400 From: Jeff King To: git@vger.kernel.org Cc: Teng Long , Junio C Hamano Subject: [PATCH 2/3] ci: run ASan/UBSan in a single job Message-ID: <20230601180947.GB4167886@coredump.intra.peff.net> References: <20230601180220.GA4167745@coredump.intra.peff.net> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20230601180220.GA4167745@coredump.intra.peff.net> Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org When we started running sanitizers in CI via 1c0962c0c4 (ci: add address and undefined sanitizer tasks, 2022-10-20), we ran them as two separate CI jobs, since as that commit notes, the combination "seems to take forever". And indeed, it does with gcc. However, since the previous commit switched to using clang, the situation is different, and we can save some CPU by using a single job for both. Comparing before/after CI runs, this saved about 14 minutes (the single combined job took 54m, versus 44m plus 24m for ASan and UBSan jobs, respectively). That's wall-clock and not CPU, but since our jobs are mostly CPU-bound, the two should be closely proportional. This does increase the end-to-end time of a CI run, though, since before this patch the two jobs could run in parallel, and the sanitizer job is our longest single job. It also means that we won't get a separate result for "this passed with UBSan but not with ASan" or vice versa). But as 1c0962c0c4 noted, that is not a very useful signal in practice. Below are some more detailed timings of gcc vs clang that I measured by running the test suite on my local workstation. Each measurement counts only the time to run the test suite with each compiler (not the compile time itself). We'll focus on the wall-clock times for simplicity, though the CPU times follow roughly similar trends. Here's a run with CC=gcc as a baseline: real 1m12.931s user 9m30.566s sys 8m9.538s Running with SANITIZE=address increases the time by a factor of ~4.7x: real 5m40.352s user 49m37.044s sys 36m42.950s Running with SANITIZE=undefined increases the time by a factor of ~1.7x: real 2m5.956s user 12m42.847s sys 19m27.067s So let's call that 6.4 time units to run them separately (where a unit is the time it takes to run the test suite with no sanitizers). As a simplistic model, we might imagine that running them together would take 5.4 units (we save 1 unit because we are no longer running the test suite twice, but just paying the sanitizer overhead on top of a single run). But that's not what happens. Running with SANITIZE=address,undefined results in a factor of 9.3x: real 11m9.817s user 77m31.284s sys 96m40.454s So not only did we not get faster when doing them together, we actually spent 1.5x as much CPU as doing them separately! And while those wall-clock numbers might not look too terrible, keep in mind that this is on an unloaded 8-core machine. In the CI environment, wall-clock times will be much closer to CPU times. So not only are we wasting CPU, but we risk hitting timeouts. Now let's try the same thing with clang. Here's our no-sanitizer baseline run, which is almost identical to the gcc one (which is quite convenient, because we can keep using the same "time units" to get an apples-to-apples comparison): real 1m11.844s user 9m28.313s sys 8m8.240s And now again with SANITIZE=address, we get a 5x factor (so slightly worse than gcc's 4.7x, though I wouldn't read too much into it; there is a fair bit of run-to-run noise): real 6m7.662s user 49m24.330s sys 44m13.846s And with SANITIZE=undefined, we are at 1.5x, slightly outperforming gcc (though again, that's probably mostly noise): real 1m50.028s user 11m0.973s sys 16m42.731s So running them separately, our total cost is 6.5x. But if we combine them in a single run (SANITIZE=address,undefined), we get: real 6m51.804s user 52m32.049s sys 51m46.711s which is a factor of 5.7x. That's along the lines we'd hoped for! Running them together saves us almost a whole time unit. And that's not counting any time spent outside the test suite itself (starting the job, setting up the environment, compiling) that we're no longer duplicating by having two jobs. So clang behaves like we'd hope: the overhead to run the sanitizers is additive as you add more sanitizers. Whereas gcc's numbers seem very close to multiplicative, almost as if the sanitizers were enforcing their overheads on each other (though that is purely a guess on what is going on; ultimately what matters to us is the amount of time it takes). And that roughly matches the CI improvement I saw. A "time unit" there is more like 12 minutes, and the observed time savings was 14 minutes (with the extra presumably coming from avoiding duplicated setup, etc). Signed-off-by: Jeff King --- I'm not excited about increasing the total run-time. However, if we want to deal with that, I suspect a better solution is to split the tests up into several parallel jobs the way the Windows ones do. That's a lot more work to set up, but it would let us bring the runtime down to match other tests (which seem to take 15-20 minutes, versus 44 minutes even before this patch). .github/workflows/main.yml | 5 +---- ci/lib.sh | 7 ++----- 2 files changed, 3 insertions(+), 9 deletions(-) diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml index 487ad31e66..2114303b7d 100644 --- a/.github/workflows/main.yml +++ b/.github/workflows/main.yml @@ -273,10 +273,7 @@ jobs: - jobname: linux-leaks cc: gcc pool: ubuntu-latest - - jobname: linux-asan - cc: clang - pool: ubuntu-latest - - jobname: linux-ubsan + - jobname: linux-asan-ubsan cc: clang pool: ubuntu-latest env: diff --git a/ci/lib.sh b/ci/lib.sh index db7105e8a8..369d462f13 100755 --- a/ci/lib.sh +++ b/ci/lib.sh @@ -278,11 +278,8 @@ linux-leaks) export GIT_TEST_PASSING_SANITIZE_LEAK=true export GIT_TEST_SANITIZE_LEAK_LOG=true ;; -linux-asan) - export SANITIZE=address - ;; -linux-ubsan) - export SANITIZE=undefined +linux-asan-ubsan) + export SANITIZE=address,undefined ;; esac From patchwork Thu Jun 1 18:10:40 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jeff King X-Patchwork-Id: 13264365 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id BA95CC77B7A for ; Thu, 1 Jun 2023 18:10:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229672AbjFASKp (ORCPT ); Thu, 1 Jun 2023 14:10:45 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50218 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229514AbjFASKn (ORCPT ); Thu, 1 Jun 2023 14:10:43 -0400 Received: from cloud.peff.net (cloud.peff.net [104.130.231.41]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 864EF13D for ; Thu, 1 Jun 2023 11:10:41 -0700 (PDT) Received: (qmail 6356 invoked by uid 109); 1 Jun 2023 18:10:41 -0000 Received: from Unknown (HELO peff.net) (10.0.1.2) by cloud.peff.net (qpsmtpd/0.94) with ESMTP; Thu, 01 Jun 2023 18:10:41 +0000 Authentication-Results: cloud.peff.net; auth=none Received: (qmail 25577 invoked by uid 111); 1 Jun 2023 18:10:40 -0000 Received: from coredump.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.2) by peff.net (qpsmtpd/0.94) with (TLS_AES_256_GCM_SHA384 encrypted) ESMTPS; Thu, 01 Jun 2023 14:10:40 -0400 Authentication-Results: peff.net; auth=none Date: Thu, 1 Jun 2023 14:10:40 -0400 From: Jeff King To: git@vger.kernel.org Cc: Teng Long , Junio C Hamano Subject: [PATCH 3/3] ci: drop linux-clang job Message-ID: <20230601181040.GC4167886@coredump.intra.peff.net> References: <20230601180220.GA4167745@coredump.intra.peff.net> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20230601180220.GA4167745@coredump.intra.peff.net> Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Since the linux-asan-ubsan job runs using clang under Linux, there is not much point in running a separate clang job. Any errors that a normal clang compile-and-test cycle would find are likely to be a subset of what the sanitizer job will find. Since this job takes ~14 minutes to run in CI, this shaves off some of our CPU load (though it does not affect end-to-end runtime, since it's typically run in parallel and is not the longest job). Technically this provides us with slightly less signal for a given run, since you won't immediately know if a failure in the sanitizer job is from using clang or from the sanitizers themselves. But it's generally obvious from the logs, and anyway your next step would be to fix the probvlem and re-run CI, since we expect all of these jobs to pass normally. Signed-off-by: Jeff King --- Obviously not necessary, but just seemed like another easy way to trim our resource usage. Really, the same logic could apply before my series to the linux-gcc job, which is similarly redundant. .github/workflows/main.yml | 3 --- 1 file changed, 3 deletions(-) diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml index 2114303b7d..079645b776 100644 --- a/.github/workflows/main.yml +++ b/.github/workflows/main.yml @@ -246,9 +246,6 @@ jobs: fail-fast: false matrix: vector: - - jobname: linux-clang - cc: clang - pool: ubuntu-latest - jobname: linux-sha256 cc: clang pool: ubuntu-latest