Message ID | 1409766441-3886-1-git-send-email-przanoni@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Sep 03, 2014 at 02:47:21PM -0300, Paulo Zanoni wrote: > From: Paulo Zanoni <paulo.r.zanoni@intel.com> > > If we don't reset exit_handler_count before forking, we may have a > case where the forked process is killed before it even does > "exit_handler_count = 0": in that case, it is still finishing forking. > When that happens, we may end up calling our exit handlers. On the > specific bug I'm investigating, we call igt_reset_connnectors(), which > ends up in a deadlock inside malloc_atfork. If we attach gdb to the > forked process and get a backtrace, we have: > > (gdb) bt > 0 __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95 > 1 0x00007f15634d36bf in _L_lock_10524 () from /lib/x86_64-linux-gnu/libc.so.6 > 2 0x00007f15634d12ef in malloc_atfork (sz=139729840351352, caller=<optimized out>) at arena.c:181 > 3 0x00007f15640466a1 in drmMalloc () from /usr/lib/x86_64-linux-gnu/libdrm.so.2 > 4 0x00007f1564049ad7 in drmModeGetResources () from /usr/lib/x86_64-linux-gnu/libdrm.so.2 > 5 0x0000000000408f84 in igt_reset_connectors () at igt_kms.c:1656 > 6 0x00000000004092dc in call_exit_handlers (sig=15) at igt_core.c:1130 > 7 fatal_sig_handler (sig=15) at igt_core.c:1154 > 8 <signal handler called> > 9 0x00007f15634cce60 in ptmalloc_unlock_all2 () at arena.c:298 > 10 0x00007f156350ca3f in __libc_fork () at ../nptl/sysdeps/unix/sysv/linux/x86_64/../fork.c:188 > 11 0x000000000040a029 in __igt_fork_helper (proc=proc@entry=0x610fc4 <signal_helper>) at igt_core.c:910 > 12 0x000000000040459d in igt_fork_signal_helper () at igt_aux.c:110 > 13 0x0000000000402ab7 in __real_main63 () at bug.c:76 > 14 0x000000000040296e in main (argc=<optimized out>, argv=<optimized out>) at bug.c:63 > > After doing some searches for "stuck at malloc_atfork", it seems to me > we probably shouldn't be doing any malloc calls at this point of the > code, so the best way to do that is to make sure we can't really run > the exit handlers. > > So on this patch, instead of resetting the exit handlers after > forking, we reset them before forking, and then restore the original > value on the parent process. > > I can reproduce this problem by running "./kms_flip --run-subtest > 2x-flip-vs-modeset" under an infinite loop. Usually after a few > hundred calls, we end up stuck on the deadlock mentioned above. QA > says this problem happens every time, but I'm not sure what is the > difference between our environments that makes the race condition so > much easier for them. > > The kms_flip.c problem can be considered a regression introduced by: > commit eef768f283466b6d7cb3f08381f72ccf3951dc99 > Author: Thomas Wood <thomas.wood@intel.com> > Date: Wed Jun 18 14:28:43 2014 +0100 > tests: enable extra connectors in kms_flip and kms_pipe_crc_basic > > even though this commit is not the one that introduced the real > problem. > > It is also possible to reproduce this problem with a few modifications > to template.c: > - Add a call to igt_enable_connectors() inside the first fixture. > - Add igt_fork_signal_helper() and igt_stop_signal_helper() calls > around subtest B. > > Cc: Thomas Wood <thomas.wood@intel.com> > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=81367 > Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com> Awesome piece of debugging! But imo the commit message is missing the explanation why the forked child exists so quickly - the reason is that for skipped tests the parent kills it right away, before libc has completed the setup code in the child. We've had fun with this earlier, see commit 139c72f38a07c545f5a9ab5fa3750779987b9275 Author: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Date: Tue Dec 3 16:44:55 2013 +0000 drmtest: Avoid wrong PID/TID after clone races and the earlier attempt at commit 9298dfabd9658315df34616b1e9a10b3579a92bd Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Fri Sep 13 16:38:59 2013 +0200 lib: add test for igt_fork_signal_helper I've added a note and commment and merged the patch, thanks. -Daniel > --- > lib/igt_core.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/lib/igt_core.c b/lib/igt_core.c > index 7f14b17..e9a27e3 100644 > --- a/lib/igt_core.c > +++ b/lib/igt_core.c > @@ -895,6 +895,7 @@ bool __igt_fork_helper(struct igt_helper_process *proc) > { > pid_t pid; > int id; > + int tmp_count; > > assert(!proc->running); > assert(helper_process_count < ARRAY_SIZE(helper_process_pids)); > @@ -904,16 +905,19 @@ bool __igt_fork_helper(struct igt_helper_process *proc) > > igt_install_exit_handler(fork_helper_exit_handler); > > + tmp_count = exit_handler_count; > + exit_handler_count = 0; > switch (pid = fork()) { > case -1: > + exit_handler_count = tmp_count; > igt_assert(0); > case 0: > - exit_handler_count = 0; > reset_helper_process_list(); > oom_adjust_for_doom(); > > return true; > default: > + exit_handler_count = tmp_count; > proc->running = true; > proc->pid = pid; > proc->id = id; > -- > 2.1.0 > > _______________________________________________ > Intel-gfx mailing list > Intel-gfx@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/intel-gfx
On Thu, Sep 04, 2014 at 10:34:13AM +0200, Daniel Vetter wrote: > On Wed, Sep 03, 2014 at 02:47:21PM -0300, Paulo Zanoni wrote: > > From: Paulo Zanoni <paulo.r.zanoni@intel.com> > > > > If we don't reset exit_handler_count before forking, we may have a > > case where the forked process is killed before it even does > > "exit_handler_count = 0": in that case, it is still finishing forking. > > When that happens, we may end up calling our exit handlers. On the > > specific bug I'm investigating, we call igt_reset_connnectors(), which > > ends up in a deadlock inside malloc_atfork. If we attach gdb to the > > forked process and get a backtrace, we have: > > > > (gdb) bt > > 0 __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95 > > 1 0x00007f15634d36bf in _L_lock_10524 () from /lib/x86_64-linux-gnu/libc.so.6 > > 2 0x00007f15634d12ef in malloc_atfork (sz=139729840351352, caller=<optimized out>) at arena.c:181 > > 3 0x00007f15640466a1 in drmMalloc () from /usr/lib/x86_64-linux-gnu/libdrm.so.2 > > 4 0x00007f1564049ad7 in drmModeGetResources () from /usr/lib/x86_64-linux-gnu/libdrm.so.2 > > 5 0x0000000000408f84 in igt_reset_connectors () at igt_kms.c:1656 > > 6 0x00000000004092dc in call_exit_handlers (sig=15) at igt_core.c:1130 > > 7 fatal_sig_handler (sig=15) at igt_core.c:1154 > > 8 <signal handler called> > > 9 0x00007f15634cce60 in ptmalloc_unlock_all2 () at arena.c:298 > > 10 0x00007f156350ca3f in __libc_fork () at ../nptl/sysdeps/unix/sysv/linux/x86_64/../fork.c:188 > > 11 0x000000000040a029 in __igt_fork_helper (proc=proc@entry=0x610fc4 <signal_helper>) at igt_core.c:910 > > 12 0x000000000040459d in igt_fork_signal_helper () at igt_aux.c:110 > > 13 0x0000000000402ab7 in __real_main63 () at bug.c:76 > > 14 0x000000000040296e in main (argc=<optimized out>, argv=<optimized out>) at bug.c:63 > > > > After doing some searches for "stuck at malloc_atfork", it seems to me > > we probably shouldn't be doing any malloc calls at this point of the > > code, so the best way to do that is to make sure we can't really run > > the exit handlers. > > > > So on this patch, instead of resetting the exit handlers after > > forking, we reset them before forking, and then restore the original > > value on the parent process. > > > > I can reproduce this problem by running "./kms_flip --run-subtest > > 2x-flip-vs-modeset" under an infinite loop. Usually after a few > > hundred calls, we end up stuck on the deadlock mentioned above. QA > > says this problem happens every time, but I'm not sure what is the > > difference between our environments that makes the race condition so > > much easier for them. > > > > The kms_flip.c problem can be considered a regression introduced by: > > commit eef768f283466b6d7cb3f08381f72ccf3951dc99 > > Author: Thomas Wood <thomas.wood@intel.com> > > Date: Wed Jun 18 14:28:43 2014 +0100 > > tests: enable extra connectors in kms_flip and kms_pipe_crc_basic > > > > even though this commit is not the one that introduced the real > > problem. > > > > It is also possible to reproduce this problem with a few modifications > > to template.c: > > - Add a call to igt_enable_connectors() inside the first fixture. > > - Add igt_fork_signal_helper() and igt_stop_signal_helper() calls > > around subtest B. > > > > Cc: Thomas Wood <thomas.wood@intel.com> > > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=81367 > > Signed-off-by: Paulo Zanoni <paulo.r.zanoni@intel.com> > > Awesome piece of debugging! > > But imo the commit message is missing the explanation why the forked child > exists so quickly - the reason is that for skipped tests the parent kills > it right away, before libc has completed the setup code in the child. > We've had fun with this earlier, see > > commit 139c72f38a07c545f5a9ab5fa3750779987b9275 > Author: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > Date: Tue Dec 3 16:44:55 2013 +0000 > > drmtest: Avoid wrong PID/TID after clone races > > and the earlier attempt at > > commit 9298dfabd9658315df34616b1e9a10b3579a92bd > Author: Daniel Vetter <daniel.vetter@ffwll.ch> > Date: Fri Sep 13 16:38:59 2013 +0200 > > lib: add test for igt_fork_signal_helper I've meant commit a031a1bf93b828585e7147f06145fc5030814547 Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Fri Sep 13 16:43:22 2013 +0200 lib/drmtest: ducttape over fork race Aside: For bugs in the test library itself I've started to create test-testcases in tests/igt_*.c. You can run them with a simple $ make check If you have an idea for a testcase then please add it, I myself didn't come up with anything. Note that the testcase can't rely at all on an i915 device being enabled, since this is meant to be run while building as a normal user on any kind of machine. So it's really just tests for the core library code. Thanks, Daniel > > I've added a note and commment and merged the patch, thanks. > -Daniel > > > > --- > > lib/igt_core.c | 6 +++++- > > 1 file changed, 5 insertions(+), 1 deletion(-) > > > > diff --git a/lib/igt_core.c b/lib/igt_core.c > > index 7f14b17..e9a27e3 100644 > > --- a/lib/igt_core.c > > +++ b/lib/igt_core.c > > @@ -895,6 +895,7 @@ bool __igt_fork_helper(struct igt_helper_process *proc) > > { > > pid_t pid; > > int id; > > + int tmp_count; > > > > assert(!proc->running); > > assert(helper_process_count < ARRAY_SIZE(helper_process_pids)); > > @@ -904,16 +905,19 @@ bool __igt_fork_helper(struct igt_helper_process *proc) > > > > igt_install_exit_handler(fork_helper_exit_handler); > > > > + tmp_count = exit_handler_count; > > + exit_handler_count = 0; > > switch (pid = fork()) { > > case -1: > > + exit_handler_count = tmp_count; > > igt_assert(0); > > case 0: > > - exit_handler_count = 0; > > reset_helper_process_list(); > > oom_adjust_for_doom(); > > > > return true; > > default: > > + exit_handler_count = tmp_count; > > proc->running = true; > > proc->pid = pid; > > proc->id = id; > > -- > > 2.1.0 > > > > _______________________________________________ > > Intel-gfx mailing list > > Intel-gfx@lists.freedesktop.org > > http://lists.freedesktop.org/mailman/listinfo/intel-gfx > > -- > Daniel Vetter > Software Engineer, Intel Corporation > +41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Thu, Sep 04, 2014 at 10:39:31AM +0200, Daniel Vetter wrote: > On Thu, Sep 04, 2014 at 10:34:13AM +0200, Daniel Vetter wrote: > > On Wed, Sep 03, 2014 at 02:47:21PM -0300, Paulo Zanoni wrote: > > > From: Paulo Zanoni <paulo.r.zanoni@intel.com> > > > > > > If we don't reset exit_handler_count before forking, we may have a > > > case where the forked process is killed before it even does > > > "exit_handler_count = 0": in that case, it is still finishing forking. > > > When that happens, we may end up calling our exit handlers. On the > > > specific bug I'm investigating, we call igt_reset_connnectors(), which > > > ends up in a deadlock inside malloc_atfork. If we attach gdb to the > > > forked process and get a backtrace, we have: > > > > > > (gdb) bt > > > 0 __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95 > > > 1 0x00007f15634d36bf in _L_lock_10524 () from /lib/x86_64-linux-gnu/libc.so.6 > > > 2 0x00007f15634d12ef in malloc_atfork (sz=139729840351352, caller=<optimized out>) at arena.c:181 > > > 3 0x00007f15640466a1 in drmMalloc () from /usr/lib/x86_64-linux-gnu/libdrm.so.2 > > > 4 0x00007f1564049ad7 in drmModeGetResources () from /usr/lib/x86_64-linux-gnu/libdrm.so.2 > > > 5 0x0000000000408f84 in igt_reset_connectors () at igt_kms.c:1656 > > > 6 0x00000000004092dc in call_exit_handlers (sig=15) at igt_core.c:1130 > > > 7 fatal_sig_handler (sig=15) at igt_core.c:1154 > > > 8 <signal handler called> > > > 9 0x00007f15634cce60 in ptmalloc_unlock_all2 () at arena.c:298 > > > 10 0x00007f156350ca3f in __libc_fork () at ../nptl/sysdeps/unix/sysv/linux/x86_64/../fork.c:188 > > > 11 0x000000000040a029 in __igt_fork_helper (proc=proc@entry=0x610fc4 <signal_helper>) at igt_core.c:910 > > > 12 0x000000000040459d in igt_fork_signal_helper () at igt_aux.c:110 > > > 13 0x0000000000402ab7 in __real_main63 () at bug.c:76 > > > 14 0x000000000040296e in main (argc=<optimized out>, argv=<optimized out>) at bug.c:63 But note that this means that igt_reset_connectors is *not* signal safe and needs to be rewritten in order for it to work as an exit handler. Otherwise this bug will just keep on reoccurring. -Chris
2014-09-04 5:48 GMT-03:00 Chris Wilson <chris@chris-wilson.co.uk>: > On Thu, Sep 04, 2014 at 10:39:31AM +0200, Daniel Vetter wrote: >> On Thu, Sep 04, 2014 at 10:34:13AM +0200, Daniel Vetter wrote: >> > On Wed, Sep 03, 2014 at 02:47:21PM -0300, Paulo Zanoni wrote: >> > > From: Paulo Zanoni <paulo.r.zanoni@intel.com> >> > > >> > > If we don't reset exit_handler_count before forking, we may have a >> > > case where the forked process is killed before it even does >> > > "exit_handler_count = 0": in that case, it is still finishing forking. >> > > When that happens, we may end up calling our exit handlers. On the >> > > specific bug I'm investigating, we call igt_reset_connnectors(), which >> > > ends up in a deadlock inside malloc_atfork. If we attach gdb to the >> > > forked process and get a backtrace, we have: >> > > >> > > (gdb) bt >> > > 0 __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95 >> > > 1 0x00007f15634d36bf in _L_lock_10524 () from /lib/x86_64-linux-gnu/libc.so.6 >> > > 2 0x00007f15634d12ef in malloc_atfork (sz=139729840351352, caller=<optimized out>) at arena.c:181 >> > > 3 0x00007f15640466a1 in drmMalloc () from /usr/lib/x86_64-linux-gnu/libdrm.so.2 >> > > 4 0x00007f1564049ad7 in drmModeGetResources () from /usr/lib/x86_64-linux-gnu/libdrm.so.2 >> > > 5 0x0000000000408f84 in igt_reset_connectors () at igt_kms.c:1656 >> > > 6 0x00000000004092dc in call_exit_handlers (sig=15) at igt_core.c:1130 >> > > 7 fatal_sig_handler (sig=15) at igt_core.c:1154 >> > > 8 <signal handler called> >> > > 9 0x00007f15634cce60 in ptmalloc_unlock_all2 () at arena.c:298 >> > > 10 0x00007f156350ca3f in __libc_fork () at ../nptl/sysdeps/unix/sysv/linux/x86_64/../fork.c:188 >> > > 11 0x000000000040a029 in __igt_fork_helper (proc=proc@entry=0x610fc4 <signal_helper>) at igt_core.c:910 >> > > 12 0x000000000040459d in igt_fork_signal_helper () at igt_aux.c:110 >> > > 13 0x0000000000402ab7 in __real_main63 () at bug.c:76 >> > > 14 0x000000000040296e in main (argc=<optimized out>, argv=<optimized out>) at bug.c:63 > > But note that this means that igt_reset_connectors is *not* signal safe > and needs to be rewritten in order for it to work as an exit handler. > Otherwise this bug will just keep on reoccurring. Yeah, I studied the problem a little bit more and now I agree. I'm hoping the current patch will at least unblock QA's testing efforts, giving more time for us to write the proper solution. I'll keep watching the original bug report. Here is the bug report for the "real"fix: https://bugs.freedesktop.org/show_bug.cgi?id=83498 . > -Chris > > -- > Chris Wilson, Intel Open Source Technology Centre
diff --git a/lib/igt_core.c b/lib/igt_core.c index 7f14b17..e9a27e3 100644 --- a/lib/igt_core.c +++ b/lib/igt_core.c @@ -895,6 +895,7 @@ bool __igt_fork_helper(struct igt_helper_process *proc) { pid_t pid; int id; + int tmp_count; assert(!proc->running); assert(helper_process_count < ARRAY_SIZE(helper_process_pids)); @@ -904,16 +905,19 @@ bool __igt_fork_helper(struct igt_helper_process *proc) igt_install_exit_handler(fork_helper_exit_handler); + tmp_count = exit_handler_count; + exit_handler_count = 0; switch (pid = fork()) { case -1: + exit_handler_count = tmp_count; igt_assert(0); case 0: - exit_handler_count = 0; reset_helper_process_list(); oom_adjust_for_doom(); return true; default: + exit_handler_count = tmp_count; proc->running = true; proc->pid = pid; proc->id = id;