Message ID | CABURp0r_VPNeU5ugbJDQo=hV_aOit7W-pf9nhDF9Og=vVJG8tw@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | RFC - concurrency causes segfault in git grep since 2.26.0 | expand |
Hi, Phil Thanks for the bug report and for the crash logs. On Thu, Sep 24, 2020 at 11:36 PM Phil Hord <phil.hord@gmail.com> wrote: > > Hi list. It's been a while. > > I can reliably reproduce a segfault in git grep when searching the > object DB. It is caused by a race when threads > 2. > > I have a patch that fixes the problem, but I don't know exactly why. > Can someone else explain it and/or offer a better solution? > > ==== > > diff --git packfile.c packfile.c > index e69012e7f2..52b7b54aeb 100644 > --- packfile.c > +++ packfile.c > @@ -1812,7 +1812,9 @@ void *unpack_entry(struct repository *r, struct > packed_git *p, off_t obj_offset, > if (!base) > continue; > > + obj_read_lock(); > delta_data = unpack_compressed_entry(p, &w_curs, > curpos, delta_size); > + obj_read_unlock(); > > if (!delta_data) { > error("failed to unpack compressed delta " > > ==== Hmm, obj_read_mutex is a recursive lock, so by adding the obj_read_lock() call here, we are incrementing the lock value to [at least] 2 (because oid_object_info_extended() had already incremented once). Therefore, when unpack_compressed_entry() calls obj_read_unlock() before inflating the entry, the lock is not actually released, only decremented. So the effect of this change is that the phase III of unpack_entry() is performed entirely without releasing the lock. Your crash log shows us that the segfault happened when trying to memcpy() the string `base` (in unpack_entry()). I.e., the same string that we had previously added to the delta base cache, right before calling unpack_compressed_entry() and releasing the lock. The problematic sequence is: add `base` to the cache -> release lock -> inflate -> acquire lock -> try to use `base` Maybe, when a thread X releases the lock to perform decompression, thread Y acquires the lock and tries to add another base to the cache. The cache is full, so Y has to remove entries, and it ends up free()'ing the base that was just inserted by X. Later, X tries to dereference `base` in patch_entry(), which would cause a segfault. It would also explain why your change solves the segfault. I'm not sure, though, why this entry would be removed, since it was just added... Maybe Y was adding a huge base to the cache, which required removing all entries? Anyways, you mentioned you can reproduce the failure quite reliably in your repo with 20 threads, right? Could you please check with the following patch applied? It adds a `base` copy to the cache instead of the original string, allowing us to keep running decompression in parallel but without the risk of having another thread free()'ing `base` in the mean time. diff --git a/packfile.c b/packfile.c index 9ef27508f2..79f83b6034 100644 --- a/packfile.c +++ b/packfile.c @@ -1772,14 +1772,15 @@ void *unpack_entry(struct repository *r, struct packed_git *p, off_t obj_offset, while (delta_stack_nr) { void *delta_data; void *base = data; - void *external_base = NULL; unsigned long delta_size, base_size = size; int i; data = NULL; - if (base) - add_delta_base_cache(p, obj_offset, base, base_size, type); + if (base) { + char *base_dup = memcpy(xmalloc(base_size), base, base_size); + add_delta_base_cache(p, obj_offset, base_dup, base_size, type); + } if (!base) { /* @@ -1799,7 +1800,6 @@ void *unpack_entry(struct repository *r, struct packed_git *p, off_t obj_offset, p->pack_name); mark_bad_packed_object(p, base_oid.hash); base = read_object(r, &base_oid, &type, &base_size); - external_base = base; } } @@ -1818,7 +1818,7 @@ void *unpack_entry(struct repository *r, struct packed_git *p, off_t obj_offset, "at offset %"PRIuMAX" from %s", (uintmax_t)curpos, p->pack_name); data = NULL; - free(external_base); + free(base); continue; } @@ -1838,7 +1838,7 @@ void *unpack_entry(struct repository *r, struct packed_git *p, off_t obj_offset, error("failed to apply delta"); free(delta_data); - free(external_base); + free(base); } if (final_type) --
Hi Matheus, Thanks for the quick response. On Thu, Sep 24, 2020 at 10:53 PM Matheus Tavares <matheus.bernardino@usp.br> wrote: > > Hi, Phil > > Thanks for the bug report and for the crash logs. > > On Thu, Sep 24, 2020 at 11:36 PM Phil Hord <phil.hord@gmail.com> wrote: > > > > Hi list. It's been a while. > > > > I can reliably reproduce a segfault in git grep when searching the > > object DB. It is caused by a race when threads > 2. > > > > I have a patch that fixes the problem, but I don't know exactly why. > > Can someone else explain it and/or offer a better solution? > > > > ==== > > > > diff --git packfile.c packfile.c > > index e69012e7f2..52b7b54aeb 100644 > > --- packfile.c > > +++ packfile.c > > @@ -1812,7 +1812,9 @@ void *unpack_entry(struct repository *r, struct > > packed_git *p, off_t obj_offset, > > if (!base) > > continue; > > > > + obj_read_lock(); > > delta_data = unpack_compressed_entry(p, &w_curs, > > curpos, delta_size); > > + obj_read_unlock(); > > > > if (!delta_data) { > > error("failed to unpack compressed delta " > > > > ==== > > Hmm, obj_read_mutex is a recursive lock, so by adding the > obj_read_lock() call here, we are incrementing the lock value to [at > least] 2 (because oid_object_info_extended() had already incremented > once). Therefore, when unpack_compressed_entry() calls obj_read_unlock() > before inflating the entry, the lock is not actually released, only > decremented. So the effect of this change is that the phase III of > unpack_entry() is performed entirely without releasing the lock. Ah, there's the piece I was missing. I didn't realize we temporarily drop the lock inside unpack_compressed_entry. And obviously I also didn't notice that we were already holding a lock before entering unpack_entry. Thanks for the detailed explanation. > Your crash log shows us that the segfault happened when trying to > memcpy() the string `base` (in unpack_entry()). I.e., the same string > that we had previously added to the delta base cache, right before > calling unpack_compressed_entry() and releasing the lock. The > problematic sequence is: > add `base` to the cache -> release lock -> inflate -> > acquire lock -> try to use `base` > > Maybe, when a thread X releases the lock to perform decompression, > thread Y acquires the lock and tries to add another base to the cache. > The cache is full, so Y has to remove entries, and it ends up free()'ing > the base that was just inserted by X. Later, X tries to dereference > `base` in patch_entry(), which would cause a segfault. It would also > explain why your change solves the segfault. > > I'm not sure, though, why this entry would be removed, since it was just > added... Maybe Y was adding a huge base to the cache, which required > removing all entries? I think your theory is correct here on both counts. The repo I am searching has some sinfully large objects, and when there is a cache limit, they are likely to exceed it. > Anyways, you mentioned you can reproduce the failure quite reliably in your > repo with 20 threads, right? Could you please check with the following patch > applied? It adds a `base` copy to the cache instead of the original string, > allowing us to keep running decompression in parallel but without the risk of > having another thread free()'ing `base` in the mean time. Yes, your patch reliably solves the problem, too. The performance is only slightly improved over holding the lock during inflate in my case, but it does save a bit of time. Surprisingly, performance seems to peak between 5 and 10 threads for me in both cases. It reliably gets slightly worse at 20 threads and 40 threads. In all cases when threads > 2, this search appears to be 10%-20% slower than running without this change. Even when threads=1, though, the result is about 10% slower. Perhaps it is worth avoiding the base_dup when !!obj_read_use_lock. > diff --git a/packfile.c b/packfile.c > index 9ef27508f2..79f83b6034 100644 > --- a/packfile.c > +++ b/packfile.c > @@ -1772,14 +1772,15 @@ void *unpack_entry(struct repository *r, struct packed_git *p, off_t obj_offset, > while (delta_stack_nr) { > void *delta_data; > void *base = data; > - void *external_base = NULL; > unsigned long delta_size, base_size = size; > int i; > > data = NULL; > > - if (base) > - add_delta_base_cache(p, obj_offset, base, base_size, type); > + if (base) { > + char *base_dup = memcpy(xmalloc(base_size), base, base_size); > + add_delta_base_cache(p, obj_offset, base_dup, base_size, type); > + } > > if (!base) { > /* > @@ -1799,7 +1800,6 @@ void *unpack_entry(struct repository *r, struct packed_git *p, off_t obj_offset, > p->pack_name); > mark_bad_packed_object(p, base_oid.hash); > base = read_object(r, &base_oid, &type, &base_size); > - external_base = base; > } > } > > @@ -1818,7 +1818,7 @@ void *unpack_entry(struct repository *r, struct packed_git *p, off_t obj_offset, > "at offset %"PRIuMAX" from %s", > (uintmax_t)curpos, p->pack_name); > data = NULL; > - free(external_base); > + free(base); > continue; > } > > @@ -1838,7 +1838,7 @@ void *unpack_entry(struct repository *r, struct packed_git *p, off_t obj_offset, > error("failed to apply delta"); > > free(delta_data); > - free(external_base); > + free(base); > } > > if (final_type) > --
diff --git packfile.c packfile.c index e69012e7f2..52b7b54aeb 100644 --- packfile.c +++ packfile.c @@ -1812,7 +1812,9 @@ void *unpack_entry(struct repository *r, struct packed_git *p, off_t obj_offset, if (!base) continue; + obj_read_lock(); delta_data = unpack_compressed_entry(p, &w_curs, curpos, delta_size); + obj_read_unlock();