[9/9] upload-pack: free tree buffers after parsing

When a client sends us a "want" or "have" line, we call parse_object()
to get an object struct. If the object is a tree, then the parsed state
means that tree->buffer points to the uncompressed contents of the tree.
But we don't really care about it. We only really need to parse commits
and tags; for trees and blobs, the important output is just a "struct
object" with the correct type.

But much worse, we do not ever free that tree buffer. It's not leaked in
the traditional sense, in that we still have a pointer to it from the
global object hash. But if the client requests many trees, we'll hold
all of their contents in memory at the same time.

Nobody really noticed because it's rare for clients to directly request
a tree. It might happen for a lightweight tag pointing straight at a
tree, or it might happen for a "tree:depth" partial clone filling in
missing trees.

But it's also possible for a malicious client to request a lot of trees,
causing upload-pack's memory to balloon. For example, without this
patch, requesting every tree in git.git like:

  pktline() {
    local msg="$*"
    printf "%04x%s\n" $((1+4+${#msg})) "$msg"
  }

  want_trees() {
    pktline command=fetch
    printf 0001
    git cat-file --batch-all-objects --batch-check='%(objectname) %(objecttype)' |
      while read oid type; do
        test "$type" = "tree" || continue
        pktline want $oid
      done
      pktline done
      printf 0000
  }

  want_trees | GIT_PROTOCOL=version=2 valgrind --tool=massif ./git upload-pack . >/dev/null

shows a peak heap usage of ~3.7GB. Which is just about the sum of the
sizes of all of the uncompressed trees. For linux.git, it's closer to
17GB.

So the obvious thing to do is to call free_tree_buffer() after we
realize that we've parsed a tree. We know that upload-pack won't need it
later. But let's push the logic into parse_object_with_flags(), telling
it to discard the tree buffer immediately. There are two reasons for
this. One, all of the relevant call-sites already call the with_options
variant to pass the SKIP_HASH flag. So it actually ends up as less code
than manually free-ing in each spot. And two, it enables an extra
optimization that I'll discuss below.

I've touched all of the sites that currently use SKIP_HASH in
upload-pack. That drops the peak heap of the upload-pack invocation
above from 3.7GB to ~24MB.

I've also modified the caller in get_reference(); a partial clone
benefits from its use in pack-objects for the reasons given in
0bc2557951 (upload-pack: skip parse-object re-hashing of "want" objects,
2022-09-06), where we were measuring blob requests. But note that the
results of get_reference() are used for traversing, as well; so we
really would _eventually_ use the tree contents. That makes this at
first glance a space/time tradeoff: we won't hold all of the trees in
memory at once, but we'll have to reload them each when it comes time to
traverse.

And here's where our extra optimization comes in. If the caller is not
going to immediately look at the tree contents, and it doesn't care
about checking the hash, then parse_object() can simply skip loading the
tree entirely, just like we do for blobs! And now it's not a space/time
tradeoff in get_reference() anymore. It's just a lazy-load: we're
delaying reading the tree contents until it's time to actually traverse
them one by one.

And of course for upload-pack, this optimization means we never load the
trees at all, saving lots of CPU time. Timing the "every tree from
git.git" request above shows upload-pack dropping from 32 seconds of CPU
to 19 (the remainder is mostly due to pack-objects actually sending the
pack; timing just the upload-pack portion shows we go from 13s to
~0.28s).

These are all highly gamed numbers, of course. For real-world
partial-clone requests we're saving only a small bit of time in
practice. But it does help harden upload-pack against malicious
denial-of-service attacks.

Signed-off-by: Jeff King <peff@peff.net>
---
 object.c      | 14 ++++++++++++++
 object.h      |  1 +
 revision.c    |  3 ++-
 upload-pack.c |  9 ++++++---
 4 files changed, 23 insertions(+), 4 deletions(-)

Message ID	20240228223907.GI1158131@coredump.intra.peff.net (mailing list archive)
State	Accepted
Commit	6cd05e768b7e54ca48b16fb0214df4c70aecd46c
Headers	show Received: from cloud.peff.net (cloud.peff.net [104.130.231.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0F1F61361C0 for <git@vger.kernel.org>; Wed, 28 Feb 2024 22:39:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=104.130.231.41 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709159950; cv=none; b=hf5aJpPPbbWkl2zgcWtIm2yEw5XDKz4r3q6RtlBiav9GUmx7kumLxWHyA9CFbbMc99GS+USx1scsK5AkDeV5+Ji2b0TmKsfMMUY5vf9SUYs0VfKyFUGjK6ZcCLCKyVlDWWHBlKDgM11ZDHAKMXC/5uCzoITxOo6u6XdmKxKpymc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709159950; c=relaxed/simple; bh=y8Mg0SOh8gXbtzmQ4zGRKaha4hpjP+WfRC0Pt7DLW88=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=NOwBAeU2rtLLgg9ARtPUhj0MW0WL8IfAU32pLriTLzPkdsumLoQE9JfLuOqV4CpIUx5V35HSaKy3BH8AhPumtDyhi4lZpZiSPyoreAeLJS4e0/4YQXL8QvkqPfp9IrbOErdkZ2PZug1AzvKpAEIV/CwstU6qBDdM0aBRusEnAa0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=peff.net; spf=pass smtp.mailfrom=peff.net; arc=none smtp.client-ip=104.130.231.41 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=peff.net Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=peff.net Received: (qmail 23704 invoked by uid 109); 28 Feb 2024 22:39:08 -0000 Received: from Unknown (HELO peff.net) (10.0.1.2) by cloud.peff.net (qpsmtpd/0.94) with ESMTP; Wed, 28 Feb 2024 22:39:08 +0000 Authentication-Results: cloud.peff.net; auth=none Received: (qmail 26780 invoked by uid 111); 28 Feb 2024 22:39:10 -0000 Received: from coredump.intra.peff.net (HELO coredump.intra.peff.net) (10.0.0.2) by peff.net (qpsmtpd/0.94) with (TLS_AES_256_GCM_SHA384 encrypted) ESMTPS; Wed, 28 Feb 2024 17:39:10 -0500 Authentication-Results: peff.net; auth=none Date: Wed, 28 Feb 2024 17:39:07 -0500 From: Jeff King <peff@peff.net> To: git@vger.kernel.org Cc: Benjamin Flesch <benjaminflesch@icloud.com> Subject: [PATCH 9/9] upload-pack: free tree buffers after parsing Message-ID: <20240228223907.GI1158131@coredump.intra.peff.net> References: <20240228223700.GA1157826@coredump.intra.peff.net> Precedence: bulk X-Mailing-List: git@vger.kernel.org List-Id: <git.vger.kernel.org> List-Subscribe: <mailto:git+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:git+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20240228223700.GA1157826@coredump.intra.peff.net>
Series	bound upload-pack memory allocations \| expand [0/9] bound upload-pack memory allocations [1/9] upload-pack: drop separate v2 "haves" array [2/9] upload-pack: switch deepen-not list to an oid_array [3/9] upload-pack: use oidset for deepen_not list [4/9] upload-pack: use a strmap for want-ref lines [5/9] upload-pack: accept only a single packfile-uri line [6/9] upload-pack: disallow object-info capability by default [7/9] upload-pack: always turn off save_commit_buffer [8/9] upload-pack: use PARSE_OBJECT_SKIP_HASH_CHECK in more places [9/9] upload-pack: free tree buffers after parsing

[9/9] upload-pack: free tree buffers after parsing

Commit Message

Comments

Patch