Message ID | cover.1699381371.git.me@ttaylorr.com (mailing list archive) |
---|---|
Headers | show |
Series | replay: implement support for writing new objects to a pack | expand |
On Tue, Nov 7, 2023 at 10:22 AM Taylor Blau <me@ttaylorr.com> wrote: > > (Based on a combination of Christian's cc/git-replay and my > tb/merge-tree-write-pack branches). > > This RFC demonstrates extending the new `--write-pack` option that > `merge-tree` recently learned to the `replay` builtin as well. > > The approach is as follows: > > - write a pack out after each step in the replay operation, so that > subsequent steps may see any new object(s) created during previous > steps > > - combine those packs into one before migrating them back into the > main object store > > This is accomplished with a combination of the bulk-checkin and > tmp-objdir APIs, with some minor modifications made to when we flush out > and finalize bulk-checkin transactions. > > The benefit to this approach is that we bound the number of inodes > required per replayed commit to a constant (by default, 3: one for the > .pack, one for the .idx, and another for the .rev file), instead of > having each operation take an unbounded number of inodes proportional to > the number of new objects created during that step. We also only migrate > a single pack back to the main object store. Isn't it actually 4? Since you only put blobs and trees into the bulk checkin packfiles, the commit object will still be loose. > In other words, this makes the maximum number of inodes required by > 'replay' grow proportional to the number of commits being replayed, > instead of the number of new *objects* created as a result of the replay > operation. As per comments on the other series, we actually need 2 packs per commit (when replaying non-merge commits, we don't have to worry about recursive merges, so 2*depth is just 2), so this would be 7 inodes per commit (3 files per pack * 2 packs + 1 loose commit object). I was curious what the comparative number of loose objects might be if we didn't use the bulk checking, so: $ cd linux.git $ git rev-list --no-merges --count HEAD 1141436 $ time git log --oneline --no-merges --name-only | wc -l 3781628 $ python -c 'print(3781628/1141436)' 3.3130442705504293 So, if repositories are similar to linux, that's a bit over 3 files modified per commit. It's a bit harder to count trees, but let's take a wild guess and say that each file is 7 directories (because I'm too lazy to do real research and 7 happens to give me nice round numbers later), so that's up to 7*3 tree objects. So 3 file objects + 21 tree objects + 1 commit object = 25 loose objects. So, your scheme certainly seems to reduce number of inodes, but does it introduce a different scaling issue? git-gc repacks when there are >= 50 packs, or when there are >= 6700 loose objects, suggesting that if we exceed those numbers, we might start seeing performance suffer. We would hit 50 packs with a mere 25 commits being replayed, and wouldn't expect to get to 6700 loose objects until we replayed about 268 commits (6700/25). Does this mean we are risking worse performance degradation with this scheme than with just using loose objects until the end of the operation?
On Tue, Nov 7, 2023 at 10:22 AM Taylor Blau <me@ttaylorr.com> wrote: > > (Based on a combination of Christian's cc/git-replay and my > tb/merge-tree-write-pack branches). > > This RFC demonstrates extending the new `--write-pack` option that > `merge-tree` recently learned to the `replay` builtin as well. > > The approach is as follows: > > - write a pack out after each step in the replay operation, so that > subsequent steps may see any new object(s) created during previous > steps > > - combine those packs into one before migrating them back into the > main object store > > This is accomplished with a combination of the bulk-checkin and > tmp-objdir APIs, with some minor modifications made to when we flush out > and finalize bulk-checkin transactions. Just thought I'd note that I finished reading all of this series as well as the five-commit series it's based on. Just wanted to note that any commits I didn't comment on from either series looked good to me.