Message ID | 20211122033220.32883-1-chiyutianyi@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | unpack large objects in stream | expand |
Han Xin <chiyutianyi@gmail.com> writes: > > From: Han Xin <hanxin.hx@alibaba-inc.com> > > Although we do not recommend users push large binary files to the git repositories, > it's difficult to prevent them from doing so. Once, we found a problem with a surge > in memory usage on the server. The source of the problem is that a user submitted > a single object with a size of 15GB. Once someone initiates a git push, the git > process will immediately allocate 15G of memory, resulting in an OOM risk. > > Through further analysis, we found that when we execute git unpack-objects, in > unpack_non_delta_entry(), "void *buf = get_data(size);" will directly allocate > memory equal to the size of the object. This is quite a scary thing, because the > pre-receive hook has not been executed at this time, and we cannot avoid this by hooks. > > I got inspiration from the deflate process of zlib, maybe it would be a good idea > to change unpack-objects to stream deflate. > Hi, Jeff. I hope you can share with me how Github solves this problem. As you said in your reply at: https://lore.kernel.org/git/YVaw6agcPNclhws8@coredump.intra.peff.net/ "we don't have a match in unpack-objects, but we always run index-pack on incoming packs". In the original implementation of "index-pack", for objects larger than big_file_threshold, "fixed_buf" with a size of 8192 will be used to complete the calculation of "oid". I tried the implementation in jk/no-more-unpack-objects, as you noted: /* XXX This will expand too-large objects! */ if (!data) data = new_data = get_data_from_pack(obj_entry); If the conditions of --unpack are given, there will be risks here. When I create an object larger than 1GB and execute index-pack, the result is as follows: $GIT_ALLOC_LIMIT=1024m git index-pack --unpack --stdin <large.pack fatal: attempting to allocate 1228800001 over limit 1073741824 Looking forward to your reply.
On Mon, Nov 29, 2021 at 03:01:47PM +0800, Han Xin wrote: > Han Xin <chiyutianyi@gmail.com> writes: > > > > From: Han Xin <hanxin.hx@alibaba-inc.com> > > > > Although we do not recommend users push large binary files to the git repositories, > > it's difficult to prevent them from doing so. Once, we found a problem with a surge > > in memory usage on the server. The source of the problem is that a user submitted > > a single object with a size of 15GB. Once someone initiates a git push, the git > > process will immediately allocate 15G of memory, resulting in an OOM risk. > > > > Through further analysis, we found that when we execute git unpack-objects, in > > unpack_non_delta_entry(), "void *buf = get_data(size);" will directly allocate > > memory equal to the size of the object. This is quite a scary thing, because the > > pre-receive hook has not been executed at this time, and we cannot avoid this by hooks. > > > > I got inspiration from the deflate process of zlib, maybe it would be a good idea > > to change unpack-objects to stream deflate. > > > > Hi, Jeff. > > I hope you can share with me how Github solves this problem. > > As you said in your reply at: > https://lore.kernel.org/git/YVaw6agcPNclhws8@coredump.intra.peff.net/ > "we don't have a match in unpack-objects, but we always run index-pack > on incoming packs". > > In the original implementation of "index-pack", for objects larger than > big_file_threshold, "fixed_buf" with a size of 8192 will be used to > complete the calculation of "oid". We set transfer.unpackLimit to "1", so we never run unpack-objects at all. We always run index-pack, and every push, no matter how small, results in a pack. We also set GIT_ALLOC_LIMIT to limit any single allocation. We also have custom code in index-pack to detect large objects (where our definition of "large" is 100MB by default): - for large blobs, we do index it as normal, writing the oid out to a file which is then processed by a pre-receive hook (since people often push up large files accidentally, the hook generates a nice error message, including finding the path at which the blob is referenced) - for other large objects, we die immediately (with an error message). 100MB commit messages aren't a common user error, and it closes off a whole set of possible integer-overflow parsing attacks (e.g., index-pack in strict-mode will run every tree through fsck_tree(), so there's otherwise nothing stopping you from having a 4GB filename in a tree). > I tried the implementation in jk/no-more-unpack-objects, as you noted: > /* XXX This will expand too-large objects! */ > if (!data) > data = new_data = get_data_from_pack(obj_entry); > If the conditions of --unpack are given, there will be risks here. > When I create an object larger than 1GB and execute index-pack, the > result is as follows: > $GIT_ALLOC_LIMIT=1024m git index-pack --unpack --stdin <large.pack > fatal: attempting to allocate 1228800001 over limit 1073741824 Yeah, that issue was one of the reasons I never sent the "index-pack --unpack" code to the list. We don't actually use those patches at GitHub. It was something I was working on for upstream but never finished. -Peff
On Tue, Nov 30, 2021 at 3:12 AM Jeff King <peff@peff.net> wrote: > We set transfer.unpackLimit to "1", so we never run unpack-objects at > all. We always run index-pack, and every push, no matter how small, > results in a pack. > > We also set GIT_ALLOC_LIMIT to limit any single allocation. We also have > custom code in index-pack to detect large objects (where our definition > of "large" is 100MB by default): > > - for large blobs, we do index it as normal, writing the oid out to a > file which is then processed by a pre-receive hook (since people > often push up large files accidentally, the hook generates a nice > error message, including finding the path at which the blob is > referenced) > > - for other large objects, we die immediately (with an error message). > 100MB commit messages aren't a common user error, and it closes off > a whole set of possible integer-overflow parsing attacks (e.g., > index-pack in strict-mode will run every tree through fsck_tree(), > so there's otherwise nothing stopping you from having a 4GB filename > in a tree). Thank you very much for sharing. The way Github handles it reminds me of what Shawn Pearce introduced in "Scaling up JGit". I guess "mulit-pack-index" and "bitmap" must play an important role in this. I will seriously consider this solution, thanks a lot.