[00/16] DEPT(Dependency Tracker)

Message ID	1645095472-26530-1-git-send-email-byungchul.park@lge.com (mailing list archive)
Headers	show Return-Path: <linux-block-owner@kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 334BAC4167E for <linux-block@archiver.kernel.org>; Thu, 17 Feb 2022 10:58:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239455AbiBQK6c (ORCPT <rfc822;linux-block@archiver.kernel.org>); Thu, 17 Feb 2022 05:58:32 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:34890 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239456AbiBQK6T (ORCPT <rfc822;linux-block@vger.kernel.org>); Thu, 17 Feb 2022 05:58:19 -0500 Received: from lgeamrelo11.lge.com (lgeamrelo13.lge.com [156.147.23.53]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id E58D6295FD1 for <linux-block@vger.kernel.org>; Thu, 17 Feb 2022 02:58:00 -0800 (PST) Received: from unknown (HELO lgeamrelo01.lge.com) (156.147.1.125) by 156.147.23.53 with ESMTP; 17 Feb 2022 19:57:58 +0900 X-Original-SENDERIP: 156.147.1.125 X-Original-MAILFROM: byungchul.park@lge.com Received: from unknown (HELO localhost.localdomain) (10.177.244.38) by 156.147.1.125 with ESMTP; 17 Feb 2022 19:57:58 +0900 X-Original-SENDERIP: 10.177.244.38 X-Original-MAILFROM: byungchul.park@lge.com From: Byungchul Park <byungchul.park@lge.com> To: torvalds@linux-foundation.org Cc: damien.lemoal@opensource.wdc.com, linux-ide@vger.kernel.org, adilger.kernel@dilger.ca, linux-ext4@vger.kernel.org, mingo@redhat.com, linux-kernel@vger.kernel.org, peterz@infradead.org, will@kernel.org, tglx@linutronix.de, rostedt@goodmis.org, joel@joelfernandes.org, sashal@kernel.org, daniel.vetter@ffwll.ch, chris@chris-wilson.co.uk, duyuyang@gmail.com, johannes.berg@intel.com, tj@kernel.org, tytso@mit.edu, willy@infradead.org, david@fromorbit.com, amir73il@gmail.com, bfields@fieldses.org, gregkh@linuxfoundation.org, kernel-team@lge.com, linux-mm@kvack.org, akpm@linux-foundation.org, mhocko@kernel.org, minchan@kernel.org, hannes@cmpxchg.org, vdavydov.dev@gmail.com, sj@kernel.org, jglisse@redhat.com, dennis@kernel.org, cl@linux.com, penberg@kernel.org, rientjes@google.com, vbabka@suse.cz, ngupta@vflare.org, linux-block@vger.kernel.org, axboe@kernel.dk, paolo.valente@linaro.org, josef@toxicpanda.com, linux-fsdevel@vger.kernel.org, viro@zeniv.linux.org.uk, jack@suse.cz, jack@suse.com, jlayton@kernel.org, dan.j.williams@intel.com, hch@infradead.org, djwong@kernel.org, dri-devel@lists.freedesktop.org, airlied@linux.ie, rodrigosiqueiramelo@gmail.com, melissa.srw@gmail.com, hamohammed.sa@gmail.com Subject: [PATCH 00/16] DEPT(Dependency Tracker) Date: Thu, 17 Feb 2022 19:57:36 +0900 Message-Id: <1645095472-26530-1-git-send-email-byungchul.park@lge.com> X-Mailer: git-send-email 1.9.1 Precedence: bulk List-ID: <linux-block.vger.kernel.org> X-Mailing-List: linux-block@vger.kernel.org
Series	DEPT(Dependency Tracker) \| expand [00/16] DEPT(Dependency Tracker) [01/16] llist: Move llist_{head,node} definition to types.h [02/16] dept: Implement Dept(Dependency Tracker) [03/16] dept: Embed Dept data in Lockdep [04/16] dept: Apply Dept to spinlock [05/16] dept: Apply Dept to mutex families [06/16] dept: Apply Dept to rwlock [07/16] dept: Apply Dept to wait_for_completion()/complete() [08/16] dept: Apply Dept to seqlock [09/16] dept: Apply Dept to rwsem [10/16] dept: Add proc knobs to show stats and dependency graph [11/16] dept: Introduce split map concept and new APIs for them [12/16] dept: Apply Dept to wait/event of PG_{locked,writeback} [13/16] dept: Apply SDT to swait [14/16] dept: Apply SDT to wait(waitqueue) [15/16] locking/lockdep, cpu/hotplus: Use a weaker annotation in AP thread [16/16] dept: Distinguish each syscall context from another

Byungchul Park Feb. 17, 2022, 10:57 a.m. UTC

Hi Linus and folks,

I've been developing a tool for detecting deadlock possibilities by
tracking wait/event rather than lock(?) acquisition order to try to
cover all synchonization machanisms. It's done on v5.17-rc1 tag.

https://github.com/lgebyungchulpark/linux-dept/commits/dept1.12_on_v5.17-rc1

Benifit:

	0. Works with all lock primitives.
	1. Works with wait_for_completion()/complete().
	2. Works with 'wait' on PG_locked.
	3. Works with 'wait' on PG_writeback.
	4. Works with swait/wakeup.
	5. Works with waitqueue.
	6. Multiple reports are allowed.
	7. Deduplication control on multiple reports.
	8. Withstand false positives thanks to 6.
	9. Easy to tag any wait/event.

Future work:

	0. To make it more stable.
	1. To separates Dept from Lockdep.
	2. To improves performance in terms of time and space.
	3. To use Dept as a dependency engine for Lockdep.
	4. To add any missing tags of wait/event in the kernel.
	5. To deduplicate stack trace.

I've got several reports from the tool. Some of them look like false
alarms and some others look like real deadlock possibility. Because of
my unfamiliarity of the domain, it's hard to confirm if it's a real one.
Let me add the reports on this email thread.

How to interpret the report is:

	1. E(event) in each context cannot be triggered because of the
	   W(wait) that cannot be woken.
	2. The stack trace helping find the problematic code is located
	   in each conext's detail.

Changes from RFC:

	1. Prevent adding a wait tag at prepare_to_wait() but __schedule().
	2. Use try version at lockdep_acquire_cpus_lock() annotation.
	3. Distinguish each syscall context from another.

Thanks,
Byungchul

Byungchul Park (16):
  llist: Move llist_{head,node} definition to types.h
  dept: Implement Dept(Dependency Tracker)
  dept: Embed Dept data in Lockdep
  dept: Apply Dept to spinlock
  dept: Apply Dept to mutex families
  dept: Apply Dept to rwlock
  dept: Apply Dept to wait_for_completion()/complete()
  dept: Apply Dept to seqlock
  dept: Apply Dept to rwsem
  dept: Add proc knobs to show stats and dependency graph
  dept: Introduce split map concept and new APIs for them
  dept: Apply Dept to wait/event of PG_{locked,writeback}
  dept: Apply SDT to swait
  dept: Apply SDT to wait(waitqueue)
  locking/lockdep, cpu/hotplus: Use a weaker annotation in AP thread
  dept: Distinguish each syscall context from another

 include/linux/completion.h         |   42 +-
 include/linux/dept.h               |  523 +++++++
 include/linux/dept_page.h          |   78 ++
 include/linux/dept_sdt.h           |   62 +
 include/linux/hardirq.h            |    3 +
 include/linux/irqflags.h           |   33 +-
 include/linux/llist.h              |    8 -
 include/linux/lockdep.h            |  156 ++-
 include/linux/lockdep_types.h      |    3 +
 include/linux/mutex.h              |   31 +
 include/linux/page-flags.h         |   45 +-
 include/linux/pagemap.h            |    7 +-
 include/linux/percpu-rwsem.h       |   10 +-
 include/linux/rtmutex.h            |    7 +
 include/linux/rwlock.h             |   48 +
 include/linux/rwlock_api_smp.h     |    8 +-
 include/linux/rwlock_types.h       |    7 +
 include/linux/rwsem.h              |   31 +
 include/linux/sched.h              |    7 +
 include/linux/seqlock.h            |   59 +-
 include/linux/spinlock.h           |   24 +
 include/linux/spinlock_types_raw.h |   13 +
 include/linux/swait.h              |    4 +
 include/linux/types.h              |    8 +
 include/linux/wait.h               |    6 +-
 init/init_task.c                   |    2 +
 init/main.c                        |    4 +
 kernel/Makefile                    |    1 +
 kernel/cpu.c                       |    2 +-
 kernel/dependency/Makefile         |    5 +
 kernel/dependency/dept.c           | 2702 ++++++++++++++++++++++++++++++++++++
 kernel/dependency/dept_hash.h      |   10 +
 kernel/dependency/dept_internal.h  |   26 +
 kernel/dependency/dept_object.h    |   13 +
 kernel/dependency/dept_proc.c      |   93 ++
 kernel/entry/common.c              |    3 +
 kernel/exit.c                      |    1 +
 kernel/fork.c                      |    2 +
 kernel/locking/lockdep.c           |   12 +-
 kernel/module.c                    |    2 +
 kernel/sched/completion.c          |   12 +-
 kernel/sched/core.c                |    3 +
 kernel/sched/swait.c               |   10 +
 kernel/sched/wait.c                |   16 +
 kernel/softirq.c                   |    6 +-
 kernel/trace/trace_preemptirq.c    |   19 +-
 lib/Kconfig.debug                  |   21 +
 mm/filemap.c                       |   68 +
 mm/page_ext.c                      |    5 +
 49 files changed, 4204 insertions(+), 57 deletions(-)
 create mode 100644 include/linux/dept.h
 create mode 100644 include/linux/dept_page.h
 create mode 100644 include/linux/dept_sdt.h
 create mode 100644 kernel/dependency/Makefile
 create mode 100644 kernel/dependency/dept.c
 create mode 100644 kernel/dependency/dept_hash.h
 create mode 100644 kernel/dependency/dept_internal.h
 create mode 100644 kernel/dependency/dept_object.h
 create mode 100644 kernel/dependency/dept_proc.c

Byungchul Park Feb. 17, 2022, 11:10 a.m. UTC | #1

[    7.009608] ===================================================
[    7.009613] DEPT: Circular dependency has been detected.
[    7.009614] 5.17.0-rc1-00014-g8a599299c0cb-dirty #30 Tainted: G        W
[    7.009616] ---------------------------------------------------
[    7.009617] summary
[    7.009618] ---------------------------------------------------
[    7.009618] *** DEADLOCK ***
[    7.009618]
[    7.009619] context A
[    7.009619]     [S] (unknown)(&(bit_wait_table + i)->dmap:0)
[    7.009621]     [W] down_write(&ei->i_data_sem:0)
[    7.009623]     [E] event(&(bit_wait_table + i)->dmap:0)
[    7.009624]
[    7.009625] context B
[    7.009625]     [S] down_read(&ei->i_data_sem:0)
[    7.009626]     [W] wait(&(bit_wait_table + i)->dmap:0)
[    7.009627]     [E] up_read(&ei->i_data_sem:0)
[    7.009628]
[    7.009629] [S]: start of the event context
[    7.009629] [W]: the wait blocked
[    7.009630] [E]: the event not reachable
[    7.009631] ---------------------------------------------------
[    7.009631] context A's detail
[    7.009632] ---------------------------------------------------
[    7.009632] context A
[    7.009633]     [S] (unknown)(&(bit_wait_table + i)->dmap:0)
[    7.009634]     [W] down_write(&ei->i_data_sem:0)
[    7.009635]     [E] event(&(bit_wait_table + i)->dmap:0)
[    7.009636]
[    7.009636] [S] (unknown)(&(bit_wait_table + i)->dmap:0):
[    7.009638] (N/A)
[    7.009638]
[    7.009639] [W] down_write(&ei->i_data_sem:0):
[    7.009639] ext4_truncate (fs/ext4/inode.c:4187) 
[    7.009645] stacktrace:
[    7.009646] down_write (kernel/locking/rwsem.c:1514) 
[    7.009648] ext4_truncate (fs/ext4/inode.c:4187) 
[    7.009650] ext4_da_write_begin (./include/linux/fs.h:827 fs/ext4/truncate.h:23 fs/ext4/inode.c:2963) 
[    7.009652] generic_perform_write (mm/filemap.c:3784) 
[    7.009654] ext4_buffered_write_iter (fs/ext4/file.c:269) 
[    7.009657] ext4_file_write_iter (fs/ext4/file.c:677) 
[    7.009659] new_sync_write (fs/read_write.c:504 (discriminator 1)) 
[    7.009662] vfs_write (fs/read_write.c:590) 
[    7.009663] ksys_write (fs/read_write.c:644) 
[    7.009664] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) 
[    7.009667] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) 
[    7.009669]
[    7.009670] [E] event(&(bit_wait_table + i)->dmap:0):
[    7.009671] __wake_up_common (kernel/sched/wait.c:108) 
[    7.009673] stacktrace:
[    7.009674] dept_event (kernel/dependency/dept.c:2337) 
[    7.009677] __wake_up_common (kernel/sched/wait.c:109) 
[    7.009678] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1)) 
[    7.009679] __wake_up_bit (kernel/sched/wait_bit.c:127) 
[    7.009681] ext4_orphan_del (fs/ext4/orphan.c:282) 
[    7.009683] ext4_truncate (fs/ext4/inode.c:4212) 
[    7.009685] ext4_da_write_begin (./include/linux/fs.h:827 fs/ext4/truncate.h:23 fs/ext4/inode.c:2963) 
[    7.009687] generic_perform_write (mm/filemap.c:3784) 
[    7.009688] ext4_buffered_write_iter (fs/ext4/file.c:269) 
[    7.009690] ext4_file_write_iter (fs/ext4/file.c:677) 
[    7.009692] new_sync_write (fs/read_write.c:504 (discriminator 1)) 
[    7.009694] vfs_write (fs/read_write.c:590) 
[    7.009695] ksys_write (fs/read_write.c:644) 
[    7.009696] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) 
[    7.009698] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) 
[    7.009700] ---------------------------------------------------
[    7.009700] context B's detail
[    7.009701] ---------------------------------------------------
[    7.009702] context B
[    7.009702]     [S] down_read(&ei->i_data_sem:0)
[    7.009703]     [W] wait(&(bit_wait_table + i)->dmap:0)
[    7.009704]     [E] up_read(&ei->i_data_sem:0)
[    7.009705]
[    7.009706] [S] down_read(&ei->i_data_sem:0):
[    7.009707] ext4_map_blocks (./arch/x86/include/asm/bitops.h:207 ./include/asm-generic/bitops/instrumented-non-atomic.h:135 fs/ext4/ext4.h:1918 fs/ext4/inode.c:562) 
[    7.009709] stacktrace:
[    7.009709] down_read (kernel/locking/rwsem.c:1461) 
[    7.009711] ext4_map_blocks (./arch/x86/include/asm/bitops.h:207 ./include/asm-generic/bitops/instrumented-non-atomic.h:135 fs/ext4/ext4.h:1918 fs/ext4/inode.c:562) 
[    7.009712] ext4_getblk (fs/ext4/inode.c:851) 
[    7.009714] ext4_bread (fs/ext4/inode.c:903) 
[    7.009715] __ext4_read_dirblock (fs/ext4/namei.c:117) 
[    7.009718] dx_probe (fs/ext4/namei.c:789) 
[    7.009720] ext4_dx_find_entry (fs/ext4/namei.c:1721) 
[    7.009722] __ext4_find_entry (fs/ext4/namei.c:1571) 
[    7.009723] ext4_lookup (fs/ext4/namei.c:1770) 
[    7.009725] lookup_open (./include/linux/dcache.h:361 fs/namei.c:3310) 
[    7.009727] path_openat (fs/namei.c:3401 fs/namei.c:3605) 
[    7.009729] do_filp_open (fs/namei.c:3637) 
[    7.009731] do_sys_openat2 (fs/open.c:1215) 
[    7.009732] do_sys_open (fs/open.c:1231) 
[    7.009734] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) 
[    7.009736] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) 
[    7.009738]
[    7.009738] [W] wait(&(bit_wait_table + i)->dmap:0):
[    7.009739] prepare_to_wait (kernel/sched/wait.c:275) 
[    7.009741] stacktrace:
[    7.009741] __schedule (kernel/sched/sched.h:1318 kernel/sched/sched.h:1616 kernel/sched/core.c:6213) 
[    7.009743] schedule (kernel/sched/core.c:6373 (discriminator 1)) 
[    7.009744] io_schedule (./arch/x86/include/asm/current.h:15 kernel/sched/core.c:8392 kernel/sched/core.c:8418) 
[    7.009745] bit_wait_io (./arch/x86/include/asm/current.h:15 kernel/sched/wait_bit.c:210) 
[    7.009746] __wait_on_bit (kernel/sched/wait_bit.c:49) 
[    7.009748] out_of_line_wait_on_bit (kernel/sched/wait_bit.c:65) 
[    7.009749] ext4_read_bh (./arch/x86/include/asm/bitops.h:207 ./include/asm-generic/bitops/instrumented-non-atomic.h:135 ./include/linux/buffer_head.h:120 fs/ext4/super.c:201) 
[    7.009752] __read_extent_tree_block (fs/ext4/extents.c:545) 
[    7.009754] ext4_find_extent (fs/ext4/extents.c:928) 
[    7.009756] ext4_ext_map_blocks (fs/ext4/extents.c:4099) 
[    7.009757] ext4_map_blocks (fs/ext4/inode.c:563) 
[    7.009759] ext4_getblk (fs/ext4/inode.c:851) 
[    7.009760] ext4_bread (fs/ext4/inode.c:903) 
[    7.009762] __ext4_read_dirblock (fs/ext4/namei.c:117) 
[    7.009764] dx_probe (fs/ext4/namei.c:789) 
[    7.009765] ext4_dx_find_entry (fs/ext4/namei.c:1721) 
[    7.009767]
[    7.009768] [E] up_read(&ei->i_data_sem:0):
[    7.009769] ext4_map_blocks (fs/ext4/inode.c:593) 
[    7.009771] stacktrace:
[    7.009771] up_read (kernel/locking/rwsem.c:1556) 
[    7.009774] ext4_map_blocks (fs/ext4/inode.c:593) 
[    7.009775] ext4_getblk (fs/ext4/inode.c:851) 
[    7.009777] ext4_bread (fs/ext4/inode.c:903) 
[    7.009778] __ext4_read_dirblock (fs/ext4/namei.c:117) 
[    7.009780] dx_probe (fs/ext4/namei.c:789) 
[    7.009782] ext4_dx_find_entry (fs/ext4/namei.c:1721) 
[    7.009784] __ext4_find_entry (fs/ext4/namei.c:1571) 
[    7.009786] ext4_lookup (fs/ext4/namei.c:1770) 
[    7.009788] lookup_open (./include/linux/dcache.h:361 fs/namei.c:3310) 
[    7.009789] path_openat (fs/namei.c:3401 fs/namei.c:3605) 
[    7.009791] do_filp_open (fs/namei.c:3637) 
[    7.009792] do_sys_openat2 (fs/open.c:1215) 
[    7.009794] do_sys_open (fs/open.c:1231) 
[    7.009795] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) 
[    7.009797] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) 
[    7.009799] ---------------------------------------------------
[    7.009800] information that might be helpful
[    7.009800] ---------------------------------------------------
[    7.009801] CPU: 0 PID: 611 Comm: rs:main Q:Reg Tainted: G        W         5.17.0-rc1-00014-g8a599299c0cb-dirty #30
[    7.009804] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[    7.009805] Call Trace:
[    7.009806]  <TASK>
[    7.009807] dump_stack_lvl (lib/dump_stack.c:107) 
[    7.009809] print_circle (./arch/x86/include/asm/atomic.h:108 ./include/linux/atomic/atomic-instrumented.h:258 kernel/dependency/dept.c:157 kernel/dependency/dept.c:762) 
[    7.009812] ? print_circle (kernel/dependency/dept.c:1086) 
[    7.009814] cb_check_dl (kernel/dependency/dept.c:1104) 
[    7.009815] bfs (kernel/dependency/dept.c:860) 
[    7.009818] add_dep (kernel/dependency/dept.c:1423) 
[    7.009820] do_event.isra.25 (kernel/dependency/dept.c:1650) 
[    7.009822] ? __wake_up_common (kernel/sched/wait.c:108) 
[    7.009824] dept_event (kernel/dependency/dept.c:2337) 
[    7.009826] __wake_up_common (kernel/sched/wait.c:109) 
[    7.009828] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1)) 
[    7.009830] __wake_up_bit (kernel/sched/wait_bit.c:127) 
[    7.009832] ext4_orphan_del (fs/ext4/orphan.c:282) 
[    7.009835] ? dept_ecxt_exit (./arch/x86/include/asm/current.h:15 kernel/dependency/dept.c:241 kernel/dependency/dept.c:999 kernel/dependency/dept.c:1043 kernel/dependency/dept.c:2478) 
[    7.009837] ext4_truncate (fs/ext4/inode.c:4212) 
[    7.009839] ext4_da_write_begin (./include/linux/fs.h:827 fs/ext4/truncate.h:23 fs/ext4/inode.c:2963) 
[    7.009842] generic_perform_write (mm/filemap.c:3784) 
[    7.009845] ext4_buffered_write_iter (fs/ext4/file.c:269) 
[    7.009848] ext4_file_write_iter (fs/ext4/file.c:677) 
[    7.009851] new_sync_write (fs/read_write.c:504 (discriminator 1)) 
[    7.009854] vfs_write (fs/read_write.c:590) 
[    7.009856] ksys_write (fs/read_write.c:644) 
[    7.009857] ? trace_hardirqs_on (kernel/trace/trace_preemptirq.c:65) 
[    7.009860] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) 
[    7.009862] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) 
[    7.009865] RIP: 0033:0x7f3b160b335d
[ 7.009867] Code: e1 20 00 00 75 10 b8 01 00 00 00 0f 05 48 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 ce fa ff ff 48 89 04 24 b8 01 00 00 00 0f 05 <48> 8b 3c 24 48 89 c2 e8 17 fb ff ff 48 89 d0 48 83 c4 08 48 3d 01
All code
========
   0:	e1 20                	loope  0x22
   2:	00 00                	add    %al,(%rax)
   4:	75 10                	jne    0x16
   6:	b8 01 00 00 00       	mov    $0x1,%eax
   b:	0f 05                	syscall 
   d:	48 3d 01 f0 ff ff    	cmp    $0xfffffffffffff001,%rax
  13:	73 31                	jae    0x46
  15:	c3                   	retq   
  16:	48 83 ec 08          	sub    $0x8,%rsp
  1a:	e8 ce fa ff ff       	callq  0xfffffffffffffaed
  1f:	48 89 04 24          	mov    %rax,(%rsp)
  23:	b8 01 00 00 00       	mov    $0x1,%eax
  28:	0f 05                	syscall 
  2a:*	48 8b 3c 24          	mov    (%rsp),%rdi		<-- trapping instruction
  2e:	48 89 c2             	mov    %rax,%rdx
  31:	e8 17 fb ff ff       	callq  0xfffffffffffffb4d
  36:	48 89 d0             	mov    %rdx,%rax
  39:	48 83 c4 08          	add    $0x8,%rsp
  3d:	48                   	rex.W
  3e:	3d                   	.byte 0x3d
  3f:	01                   	.byte 0x1

Code starting with the faulting instruction
===========================================
   0:	48 8b 3c 24          	mov    (%rsp),%rdi
   4:	48 89 c2             	mov    %rax,%rdx
   7:	e8 17 fb ff ff       	callq  0xfffffffffffffb23
   c:	48 89 d0             	mov    %rdx,%rax
   f:	48 83 c4 08          	add    $0x8,%rsp
  13:	48                   	rex.W
  14:	3d                   	.byte 0x3d
  15:	01                   	.byte 0x1
[    7.009869] RSP: 002b:00007f3b1340f180 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
[    7.009871] RAX: ffffffffffffffda RBX: 00007f3b040010a0 RCX: 00007f3b160b335d
[    7.009873] RDX: 0000000000000300 RSI: 00007f3b040010a0 RDI: 0000000000000001
[    7.009874] RBP: 0000000000000000 R08: fffffffffffffa15 R09: fffffffffffffa05
[    7.009875] R10: 0000000000000000 R11: 0000000000000293 R12: 00007f3b04000df0
[    7.009876] R13: 00007f3b1340f1a0 R14: 0000000000000220 R15: 0000000000000300
[    7.009879]  </TASK>

Byungchul Park Feb. 17, 2022, 11:10 a.m. UTC | #2

[    9.008161] ===================================================
[    9.008163] DEPT: Circular dependency has been detected.
[    9.008164] 5.17.0-rc1-00015-gb94f67143867-dirty #2 Tainted: G        W
[    9.008166] ---------------------------------------------------
[    9.008167] summary
[    9.008167] ---------------------------------------------------
[    9.008168] *** DEADLOCK ***
[    9.008168]
[    9.008168] context A
[    9.008169]     [S] (unknown)(&(&journal->j_wait_transaction_locked)->dmap:0)
[    9.008171]     [W] wait(&(&journal->j_wait_commit)->dmap:0)
[    9.008172]     [E] event(&(&journal->j_wait_transaction_locked)->dmap:0)
[    9.008173]
[    9.008173] context B
[    9.008174]     [S] down_write(mapping.invalidate_lock:0)
[    9.008175]     [W] wait(&(&journal->j_wait_transaction_locked)->dmap:0)
[    9.008176]     [E] up_write(mapping.invalidate_lock:0)
[    9.008177]
[    9.008178] context C
[    9.008179]     [S] (unknown)(&(&journal->j_wait_commit)->dmap:0)
[    9.008180]     [W] down_write(mapping.invalidate_lock:0)
[    9.008181]     [E] event(&(&journal->j_wait_commit)->dmap:0)
[    9.008181]
[    9.008182] [S]: start of the event context
[    9.008183] [W]: the wait blocked
[    9.008183] [E]: the event not reachable
[    9.008184] ---------------------------------------------------
[    9.008184] context A's detail
[    9.008185] ---------------------------------------------------
[    9.008186] context A
[    9.008186]     [S] (unknown)(&(&journal->j_wait_transaction_locked)->dmap:0)
[    9.008187]     [W] wait(&(&journal->j_wait_commit)->dmap:0)
[    9.008188]     [E] event(&(&journal->j_wait_transaction_locked)->dmap:0)
[    9.008189]
[    9.008190] [S] (unknown)(&(&journal->j_wait_transaction_locked)->dmap:0):
[    9.008191] (N/A)
[    9.008191]
[    9.008192] [W] wait(&(&journal->j_wait_commit)->dmap:0):
[    9.008193] prepare_to_wait (kernel/sched/wait.c:275) 
[    9.008197] stacktrace:
[    9.008198] __schedule (kernel/sched/sched.h:1318 kernel/sched/sched.h:1616 kernel/sched/core.c:6213) 
[    9.008200] schedule (kernel/sched/core.c:6373 (discriminator 1)) 
[    9.008201] kjournald2 (fs/jbd2/journal.c:250) 
[    9.008203] kthread (kernel/kthread.c:377) 
[    9.008206] ret_from_fork (arch/x86/entry/entry_64.S:301) 
[    9.008209]
[    9.008209] [E] event(&(&journal->j_wait_transaction_locked)->dmap:0):
[    9.008210] __wake_up_common (kernel/sched/wait.c:108) 
[    9.008212] stacktrace:
[    9.008213] dept_event (kernel/dependency/dept.c:2337) 
[    9.008215] __wake_up_common (kernel/sched/wait.c:109) 
[    9.008217] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1)) 
[    9.008218] jbd2_journal_commit_transaction (fs/jbd2/commit.c:583) 
[    9.008221] kjournald2 (fs/jbd2/journal.c:214 (discriminator 3)) 
[    9.008223] kthread (kernel/kthread.c:377) 
[    9.008224] ret_from_fork (arch/x86/entry/entry_64.S:301) 
[    9.008226] ---------------------------------------------------
[    9.008226] context B's detail
[    9.008227] ---------------------------------------------------
[    9.008228] context B
[    9.008228]     [S] down_write(mapping.invalidate_lock:0)
[    9.008229]     [W] wait(&(&journal->j_wait_transaction_locked)->dmap:0)
[    9.008230]     [E] up_write(mapping.invalidate_lock:0)
[    9.008231]
[    9.008232] [S] down_write(mapping.invalidate_lock:0):
[    9.008233] ext4_da_write_begin (fs/ext4/truncate.h:21 fs/ext4/inode.c:2963) 
[    9.008237] stacktrace:
[    9.008237] down_write (kernel/locking/rwsem.c:1514) 
[    9.008239] ext4_da_write_begin (fs/ext4/truncate.h:21 fs/ext4/inode.c:2963) 
[    9.008241] generic_perform_write (mm/filemap.c:3784) 
[    9.008243] ext4_buffered_write_iter (fs/ext4/file.c:269) 
[    9.008245] ext4_file_write_iter (fs/ext4/file.c:677) 
[    9.008247] new_sync_write (fs/read_write.c:504 (discriminator 1)) 
[    9.008250] vfs_write (fs/read_write.c:590) 
[    9.008251] ksys_write (fs/read_write.c:644) 
[    9.008253] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) 
[    9.008255] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) 
[    9.008258]
[    9.008258] [W] wait(&(&journal->j_wait_transaction_locked)->dmap:0):
[    9.008259] prepare_to_wait (kernel/sched/wait.c:275) 
[    9.008261] stacktrace:
[    9.008261] __schedule (kernel/sched/sched.h:1318 kernel/sched/sched.h:1616 kernel/sched/core.c:6213) 
[    9.008263] schedule (kernel/sched/core.c:6373 (discriminator 1)) 
[    9.008264] wait_transaction_locked (fs/jbd2/transaction.c:184) 
[    9.008266] add_transaction_credits (fs/jbd2/transaction.c:248 (discriminator 3)) 
[    9.008267] start_this_handle (fs/jbd2/transaction.c:427) 
[    9.008269] jbd2__journal_start (fs/jbd2/transaction.c:526) 
[    9.008271] __ext4_journal_start_sb (fs/ext4/ext4_jbd2.c:105) 
[    9.008273] ext4_truncate (fs/ext4/inode.c:4164) 
[    9.008274] ext4_da_write_begin (./include/linux/fs.h:827 fs/ext4/truncate.h:23 fs/ext4/inode.c:2963) 
[    9.008276] generic_perform_write (mm/filemap.c:3784) 
[    9.008277] ext4_buffered_write_iter (fs/ext4/file.c:269) 
[    9.008279] ext4_file_write_iter (fs/ext4/file.c:677) 
[    9.008281] new_sync_write (fs/read_write.c:504 (discriminator 1)) 
[    9.008283] vfs_write (fs/read_write.c:590) 
[    9.008284] ksys_write (fs/read_write.c:644) 
[    9.008285] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) 
[    9.008287]
[    9.008288] [E] up_write(mapping.invalidate_lock:0):
[    9.008288] ext4_da_get_block_prep (fs/ext4/inode.c:1795 fs/ext4/inode.c:1829) 
[    9.008291] ---------------------------------------------------
[    9.008291] context C's detail
[    9.008292] ---------------------------------------------------
[    9.008292] context C
[    9.008293]     [S] (unknown)(&(&journal->j_wait_commit)->dmap:0)
[    9.008294]     [W] down_write(mapping.invalidate_lock:0)
[    9.008295]     [E] event(&(&journal->j_wait_commit)->dmap:0)
[    9.008296]
[    9.008297] [S] (unknown)(&(&journal->j_wait_commit)->dmap:0):
[    9.008298] (N/A)
[    9.008298]
[    9.008299] [W] down_write(mapping.invalidate_lock:0):
[    9.008299] ext4_da_write_begin (fs/ext4/truncate.h:21 fs/ext4/inode.c:2963) 
[    9.008302] stacktrace:
[    9.008302] down_write (kernel/locking/rwsem.c:1514) 
[    9.008304] ext4_da_write_begin (fs/ext4/truncate.h:21 fs/ext4/inode.c:2963) 
[    9.008305] generic_perform_write (mm/filemap.c:3784) 
[    9.008307] ext4_buffered_write_iter (fs/ext4/file.c:269) 
[    9.008309] ext4_file_write_iter (fs/ext4/file.c:677) 
[    9.008311] new_sync_write (fs/read_write.c:504 (discriminator 1)) 
[    9.008312] vfs_write (fs/read_write.c:590) 
[    9.008314] ksys_write (fs/read_write.c:644) 
[    9.008315] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) 
[    9.008316] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) 
[    9.008318]
[    9.008319] [E] event(&(&journal->j_wait_commit)->dmap:0):
[    9.008320] __wake_up_common (kernel/sched/wait.c:108) 
[    9.008321] stacktrace:
[    9.008322] __wake_up_common (kernel/sched/wait.c:109) 
[    9.008323] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1)) 
[    9.008324] __jbd2_log_start_commit (fs/jbd2/journal.c:508) 
[    9.008326] jbd2_log_start_commit (fs/jbd2/journal.c:527) 
[    9.008327] __jbd2_journal_force_commit (fs/jbd2/journal.c:560) 
[    9.008329] jbd2_journal_force_commit_nested (fs/jbd2/journal.c:583) 
[    9.008331] ext4_should_retry_alloc (fs/ext4/balloc.c:670 (discriminator 3)) 
[    9.008332] ext4_da_write_begin (fs/ext4/inode.c:2965 (discriminator 1)) 
[    9.008334] generic_perform_write (mm/filemap.c:3784) 
[    9.008335] ext4_buffered_write_iter (fs/ext4/file.c:269) 
[    9.008337] ext4_file_write_iter (fs/ext4/file.c:677) 
[    9.008339] new_sync_write (fs/read_write.c:504 (discriminator 1)) 
[    9.008341] vfs_write (fs/read_write.c:590) 
[    9.008342] ksys_write (fs/read_write.c:644) 
[    9.008343] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) 
[    9.008345] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) 
[    9.008347] ---------------------------------------------------
[    9.008348] information that might be helpful
[    9.008348] ---------------------------------------------------
[    9.008349] CPU: 0 PID: 89 Comm: jbd2/sda1-8 Tainted: G        W         5.17.0-rc1-00015-gb94f67143867-dirty #2
[    9.008352] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[    9.008353] Call Trace:
[    9.008354]  <TASK>
[    9.008355] dump_stack_lvl (lib/dump_stack.c:107) 
[    9.008358] print_circle (./arch/x86/include/asm/atomic.h:108 ./include/linux/atomic/atomic-instrumented.h:258 kernel/dependency/dept.c:157 kernel/dependency/dept.c:762) 
[    9.008360] ? print_circle (kernel/dependency/dept.c:1086) 
[    9.008362] cb_check_dl (kernel/dependency/dept.c:1104) 
[    9.008364] bfs (kernel/dependency/dept.c:860) 
[    9.008366] add_dep (kernel/dependency/dept.c:1423) 
[    9.008368] do_event.isra.25 (kernel/dependency/dept.c:1651) 
[    9.008370] ? __wake_up_common (kernel/sched/wait.c:108) 
[    9.008372] dept_event (kernel/dependency/dept.c:2337) 
[    9.008374] __wake_up_common (kernel/sched/wait.c:109) 
[    9.008376] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1)) 
[    9.008379] jbd2_journal_commit_transaction (fs/jbd2/commit.c:583) 
[    9.008381] ? arch_stack_walk (arch/x86/kernel/stacktrace.c:24) 
[    9.008385] ? ret_from_fork (arch/x86/entry/entry_64.S:301) 
[    9.008387] ? dept_enable_hardirq (./arch/x86/include/asm/current.h:15 kernel/dependency/dept.c:241 kernel/dependency/dept.c:999 kernel/dependency/dept.c:1043 kernel/dependency/dept.c:1843) 
[    9.008389] ? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/irqflags.h:45 ./arch/x86/include/asm/irqflags.h:80 ./arch/x86/include/asm/irqflags.h:138 ./include/linux/spinlock_api_smp.h:151 kernel/locking/spinlock.c:194) 
[    9.008392] ? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/preempt.h:103 ./include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:194) 
[    9.008394] ? try_to_del_timer_sync (kernel/time/timer.c:1239) 
[    9.008396] kjournald2 (fs/jbd2/journal.c:214 (discriminator 3)) 
[    9.008398] ? prepare_to_wait_exclusive (kernel/sched/wait.c:431) 
[    9.008400] ? commit_timeout (fs/jbd2/journal.c:173) 
[    9.008402] kthread (kernel/kthread.c:377) 
[    9.008404] ? kthread_complete_and_exit (kernel/kthread.c:332) 
[    9.008407] ret_from_fork (arch/x86/entry/entry_64.S:301) 
[    9.008410]  </TASK>

Matthew Wilcox Feb. 17, 2022, 1:27 p.m. UTC | #3

On Thu, Feb 17, 2022 at 08:10:03PM +0900, Byungchul Park wrote:
> [    7.009608] ===================================================
> [    7.009613] DEPT: Circular dependency has been detected.
> [    7.009614] 5.17.0-rc1-00014-g8a599299c0cb-dirty #30 Tainted: G        W
> [    7.009616] ---------------------------------------------------
> [    7.009617] summary
> [    7.009618] ---------------------------------------------------
> [    7.009618] *** DEADLOCK ***
> [    7.009618]
> [    7.009619] context A
> [    7.009619]     [S] (unknown)(&(bit_wait_table + i)->dmap:0)

Why is the context unknown here?  I don't see a way to debug this
without knowing where we acquired the bit wait lock.

> [    7.009621]     [W] down_write(&ei->i_data_sem:0)
> [    7.009623]     [E] event(&(bit_wait_table + i)->dmap:0)
> [    7.009624]
> [    7.009625] context B
> [    7.009625]     [S] down_read(&ei->i_data_sem:0)
> [    7.009626]     [W] wait(&(bit_wait_table + i)->dmap:0)
> [    7.009627]     [E] up_read(&ei->i_data_sem:0)
> [    7.009628]
> [    7.009629] [S]: start of the event context
> [    7.009629] [W]: the wait blocked
> [    7.009630] [E]: the event not reachable
> [    7.009631] ---------------------------------------------------
> [    7.009631] context A's detail
> [    7.009632] ---------------------------------------------------
> [    7.009632] context A
> [    7.009633]     [S] (unknown)(&(bit_wait_table + i)->dmap:0)
> [    7.009634]     [W] down_write(&ei->i_data_sem:0)
> [    7.009635]     [E] event(&(bit_wait_table + i)->dmap:0)
> [    7.009636]
> [    7.009636] [S] (unknown)(&(bit_wait_table + i)->dmap:0):
> [    7.009638] (N/A)
> [    7.009638]
> [    7.009639] [W] down_write(&ei->i_data_sem:0):
> [    7.009639] ext4_truncate (fs/ext4/inode.c:4187) 
> [    7.009645] stacktrace:
> [    7.009646] down_write (kernel/locking/rwsem.c:1514) 
> [    7.009648] ext4_truncate (fs/ext4/inode.c:4187) 
> [    7.009650] ext4_da_write_begin (./include/linux/fs.h:827 fs/ext4/truncate.h:23 fs/ext4/inode.c:2963) 
> [    7.009652] generic_perform_write (mm/filemap.c:3784) 
> [    7.009654] ext4_buffered_write_iter (fs/ext4/file.c:269) 
> [    7.009657] ext4_file_write_iter (fs/ext4/file.c:677) 
> [    7.009659] new_sync_write (fs/read_write.c:504 (discriminator 1)) 
> [    7.009662] vfs_write (fs/read_write.c:590) 
> [    7.009663] ksys_write (fs/read_write.c:644) 
> [    7.009664] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) 
> [    7.009667] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) 
> [    7.009669]
> [    7.009670] [E] event(&(bit_wait_table + i)->dmap:0):
> [    7.009671] __wake_up_common (kernel/sched/wait.c:108) 
> [    7.009673] stacktrace:
> [    7.009674] dept_event (kernel/dependency/dept.c:2337) 
> [    7.009677] __wake_up_common (kernel/sched/wait.c:109) 
> [    7.009678] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1)) 
> [    7.009679] __wake_up_bit (kernel/sched/wait_bit.c:127) 
> [    7.009681] ext4_orphan_del (fs/ext4/orphan.c:282) 
> [    7.009683] ext4_truncate (fs/ext4/inode.c:4212) 
> [    7.009685] ext4_da_write_begin (./include/linux/fs.h:827 fs/ext4/truncate.h:23 fs/ext4/inode.c:2963) 
> [    7.009687] generic_perform_write (mm/filemap.c:3784) 
> [    7.009688] ext4_buffered_write_iter (fs/ext4/file.c:269) 
> [    7.009690] ext4_file_write_iter (fs/ext4/file.c:677) 
> [    7.009692] new_sync_write (fs/read_write.c:504 (discriminator 1)) 
> [    7.009694] vfs_write (fs/read_write.c:590) 
> [    7.009695] ksys_write (fs/read_write.c:644) 
> [    7.009696] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) 
> [    7.009698] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) 
> [    7.009700] ---------------------------------------------------
> [    7.009700] context B's detail
> [    7.009701] ---------------------------------------------------
> [    7.009702] context B
> [    7.009702]     [S] down_read(&ei->i_data_sem:0)
> [    7.009703]     [W] wait(&(bit_wait_table + i)->dmap:0)
> [    7.009704]     [E] up_read(&ei->i_data_sem:0)
> [    7.009705]
> [    7.009706] [S] down_read(&ei->i_data_sem:0):
> [    7.009707] ext4_map_blocks (./arch/x86/include/asm/bitops.h:207 ./include/asm-generic/bitops/instrumented-non-atomic.h:135 fs/ext4/ext4.h:1918 fs/ext4/inode.c:562) 
> [    7.009709] stacktrace:
> [    7.009709] down_read (kernel/locking/rwsem.c:1461) 
> [    7.009711] ext4_map_blocks (./arch/x86/include/asm/bitops.h:207 ./include/asm-generic/bitops/instrumented-non-atomic.h:135 fs/ext4/ext4.h:1918 fs/ext4/inode.c:562) 
> [    7.009712] ext4_getblk (fs/ext4/inode.c:851) 
> [    7.009714] ext4_bread (fs/ext4/inode.c:903) 
> [    7.009715] __ext4_read_dirblock (fs/ext4/namei.c:117) 
> [    7.009718] dx_probe (fs/ext4/namei.c:789) 
> [    7.009720] ext4_dx_find_entry (fs/ext4/namei.c:1721) 
> [    7.009722] __ext4_find_entry (fs/ext4/namei.c:1571) 
> [    7.009723] ext4_lookup (fs/ext4/namei.c:1770) 
> [    7.009725] lookup_open (./include/linux/dcache.h:361 fs/namei.c:3310) 
> [    7.009727] path_openat (fs/namei.c:3401 fs/namei.c:3605) 
> [    7.009729] do_filp_open (fs/namei.c:3637) 
> [    7.009731] do_sys_openat2 (fs/open.c:1215) 
> [    7.009732] do_sys_open (fs/open.c:1231) 
> [    7.009734] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) 
> [    7.009736] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) 
> [    7.009738]
> [    7.009738] [W] wait(&(bit_wait_table + i)->dmap:0):
> [    7.009739] prepare_to_wait (kernel/sched/wait.c:275) 
> [    7.009741] stacktrace:
> [    7.009741] __schedule (kernel/sched/sched.h:1318 kernel/sched/sched.h:1616 kernel/sched/core.c:6213) 
> [    7.009743] schedule (kernel/sched/core.c:6373 (discriminator 1)) 
> [    7.009744] io_schedule (./arch/x86/include/asm/current.h:15 kernel/sched/core.c:8392 kernel/sched/core.c:8418) 
> [    7.009745] bit_wait_io (./arch/x86/include/asm/current.h:15 kernel/sched/wait_bit.c:210) 
> [    7.009746] __wait_on_bit (kernel/sched/wait_bit.c:49) 
> [    7.009748] out_of_line_wait_on_bit (kernel/sched/wait_bit.c:65) 
> [    7.009749] ext4_read_bh (./arch/x86/include/asm/bitops.h:207 ./include/asm-generic/bitops/instrumented-non-atomic.h:135 ./include/linux/buffer_head.h:120 fs/ext4/super.c:201) 
> [    7.009752] __read_extent_tree_block (fs/ext4/extents.c:545) 
> [    7.009754] ext4_find_extent (fs/ext4/extents.c:928) 
> [    7.009756] ext4_ext_map_blocks (fs/ext4/extents.c:4099) 
> [    7.009757] ext4_map_blocks (fs/ext4/inode.c:563) 
> [    7.009759] ext4_getblk (fs/ext4/inode.c:851) 
> [    7.009760] ext4_bread (fs/ext4/inode.c:903) 
> [    7.009762] __ext4_read_dirblock (fs/ext4/namei.c:117) 
> [    7.009764] dx_probe (fs/ext4/namei.c:789) 
> [    7.009765] ext4_dx_find_entry (fs/ext4/namei.c:1721) 
> [    7.009767]
> [    7.009768] [E] up_read(&ei->i_data_sem:0):
> [    7.009769] ext4_map_blocks (fs/ext4/inode.c:593) 
> [    7.009771] stacktrace:
> [    7.009771] up_read (kernel/locking/rwsem.c:1556) 
> [    7.009774] ext4_map_blocks (fs/ext4/inode.c:593) 
> [    7.009775] ext4_getblk (fs/ext4/inode.c:851) 
> [    7.009777] ext4_bread (fs/ext4/inode.c:903) 
> [    7.009778] __ext4_read_dirblock (fs/ext4/namei.c:117) 
> [    7.009780] dx_probe (fs/ext4/namei.c:789) 
> [    7.009782] ext4_dx_find_entry (fs/ext4/namei.c:1721) 
> [    7.009784] __ext4_find_entry (fs/ext4/namei.c:1571) 
> [    7.009786] ext4_lookup (fs/ext4/namei.c:1770) 
> [    7.009788] lookup_open (./include/linux/dcache.h:361 fs/namei.c:3310) 
> [    7.009789] path_openat (fs/namei.c:3401 fs/namei.c:3605) 
> [    7.009791] do_filp_open (fs/namei.c:3637) 
> [    7.009792] do_sys_openat2 (fs/open.c:1215) 
> [    7.009794] do_sys_open (fs/open.c:1231) 
> [    7.009795] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) 
> [    7.009797] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) 
> [    7.009799] ---------------------------------------------------
> [    7.009800] information that might be helpful
> [    7.009800] ---------------------------------------------------
> [    7.009801] CPU: 0 PID: 611 Comm: rs:main Q:Reg Tainted: G        W         5.17.0-rc1-00014-g8a599299c0cb-dirty #30
> [    7.009804] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> [    7.009805] Call Trace:
> [    7.009806]  <TASK>
> [    7.009807] dump_stack_lvl (lib/dump_stack.c:107) 
> [    7.009809] print_circle (./arch/x86/include/asm/atomic.h:108 ./include/linux/atomic/atomic-instrumented.h:258 kernel/dependency/dept.c:157 kernel/dependency/dept.c:762) 
> [    7.009812] ? print_circle (kernel/dependency/dept.c:1086) 
> [    7.009814] cb_check_dl (kernel/dependency/dept.c:1104) 
> [    7.009815] bfs (kernel/dependency/dept.c:860) 
> [    7.009818] add_dep (kernel/dependency/dept.c:1423) 
> [    7.009820] do_event.isra.25 (kernel/dependency/dept.c:1650) 
> [    7.009822] ? __wake_up_common (kernel/sched/wait.c:108) 
> [    7.009824] dept_event (kernel/dependency/dept.c:2337) 
> [    7.009826] __wake_up_common (kernel/sched/wait.c:109) 
> [    7.009828] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1)) 
> [    7.009830] __wake_up_bit (kernel/sched/wait_bit.c:127) 
> [    7.009832] ext4_orphan_del (fs/ext4/orphan.c:282) 
> [    7.009835] ? dept_ecxt_exit (./arch/x86/include/asm/current.h:15 kernel/dependency/dept.c:241 kernel/dependency/dept.c:999 kernel/dependency/dept.c:1043 kernel/dependency/dept.c:2478) 
> [    7.009837] ext4_truncate (fs/ext4/inode.c:4212) 
> [    7.009839] ext4_da_write_begin (./include/linux/fs.h:827 fs/ext4/truncate.h:23 fs/ext4/inode.c:2963) 
> [    7.009842] generic_perform_write (mm/filemap.c:3784) 
> [    7.009845] ext4_buffered_write_iter (fs/ext4/file.c:269) 
> [    7.009848] ext4_file_write_iter (fs/ext4/file.c:677) 
> [    7.009851] new_sync_write (fs/read_write.c:504 (discriminator 1)) 
> [    7.009854] vfs_write (fs/read_write.c:590) 
> [    7.009856] ksys_write (fs/read_write.c:644) 
> [    7.009857] ? trace_hardirqs_on (kernel/trace/trace_preemptirq.c:65) 
> [    7.009860] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) 
> [    7.009862] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) 
> [    7.009865] RIP: 0033:0x7f3b160b335d
> [ 7.009867] Code: e1 20 00 00 75 10 b8 01 00 00 00 0f 05 48 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 ce fa ff ff 48 89 04 24 b8 01 00 00 00 0f 05 <48> 8b 3c 24 48 89 c2 e8 17 fb ff ff 48 89 d0 48 83 c4 08 48 3d 01
> All code
> ========
>    0:	e1 20                	loope  0x22
>    2:	00 00                	add    %al,(%rax)
>    4:	75 10                	jne    0x16
>    6:	b8 01 00 00 00       	mov    $0x1,%eax
>    b:	0f 05                	syscall 
>    d:	48 3d 01 f0 ff ff    	cmp    $0xfffffffffffff001,%rax
>   13:	73 31                	jae    0x46
>   15:	c3                   	retq   
>   16:	48 83 ec 08          	sub    $0x8,%rsp
>   1a:	e8 ce fa ff ff       	callq  0xfffffffffffffaed
>   1f:	48 89 04 24          	mov    %rax,(%rsp)
>   23:	b8 01 00 00 00       	mov    $0x1,%eax
>   28:	0f 05                	syscall 
>   2a:*	48 8b 3c 24          	mov    (%rsp),%rdi		<-- trapping instruction
>   2e:	48 89 c2             	mov    %rax,%rdx
>   31:	e8 17 fb ff ff       	callq  0xfffffffffffffb4d
>   36:	48 89 d0             	mov    %rdx,%rax
>   39:	48 83 c4 08          	add    $0x8,%rsp
>   3d:	48                   	rex.W
>   3e:	3d                   	.byte 0x3d
>   3f:	01                   	.byte 0x1
> 
> Code starting with the faulting instruction
> ===========================================
>    0:	48 8b 3c 24          	mov    (%rsp),%rdi
>    4:	48 89 c2             	mov    %rax,%rdx
>    7:	e8 17 fb ff ff       	callq  0xfffffffffffffb23
>    c:	48 89 d0             	mov    %rdx,%rax
>    f:	48 83 c4 08          	add    $0x8,%rsp
>   13:	48                   	rex.W
>   14:	3d                   	.byte 0x3d
>   15:	01                   	.byte 0x1
> [    7.009869] RSP: 002b:00007f3b1340f180 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
> [    7.009871] RAX: ffffffffffffffda RBX: 00007f3b040010a0 RCX: 00007f3b160b335d
> [    7.009873] RDX: 0000000000000300 RSI: 00007f3b040010a0 RDI: 0000000000000001
> [    7.009874] RBP: 0000000000000000 R08: fffffffffffffa15 R09: fffffffffffffa05
> [    7.009875] R10: 0000000000000000 R11: 0000000000000293 R12: 00007f3b04000df0
> [    7.009876] R13: 00007f3b1340f1a0 R14: 0000000000000220 R15: 0000000000000300
> [    7.009879]  </TASK>

Theodore Ts'o Feb. 17, 2022, 3:51 p.m. UTC | #4

On Thu, Feb 17, 2022 at 07:57:36PM +0900, Byungchul Park wrote:
> 
> I've got several reports from the tool. Some of them look like false
> alarms and some others look like real deadlock possibility. Because of
> my unfamiliarity of the domain, it's hard to confirm if it's a real one.
> Let me add the reports on this email thread.

The problem is we have so many potentially invalid, or
so-rare-as-to-be-not-worth-the-time-to-investigate-in-the-
grand-scheme-of-all-of-the-fires-burning-on-maintainers laps that it's
really not reasonable to ask maintainers to determine whether
something is a false alarm or not.  If I want more of these unreliable
potential bug reports to investigate, there is a huge backlog in
Syzkaller.  :-)

Looking at the second ext4 report, it doesn't make any sense.  Context
A is the kjournald thread.  We don't do a commit until (a) the timeout
expires, or (b) someone explicitly requests that a commit happen
waking up j_wait_commit.  I'm guessing that complaint here is that
DEPT thinks nothing is explicitly requesting a wake up.  But note that
after 5 seconds (or whatever journal->j_commit_interval) is configured
to be we *will* always start a commit.  So ergo, there can't be a deadlock.

At a higher level of discussion, it's an unfair tax on maintainer's
times to ask maintainers to help you debug DEPT for you.  Tools like
Syzkaller and DEPT are useful insofar as they save us time in making
our subsystems better.  But until you can prove that it's not going to
be a massive denial of service attack on maintainer's time, at the
*very* least keep an RFC on the patch, or add massive warnings that
more often than not DEPT is going to be sending maintainers on a wild
goose chase.

If you know there there "appear to be false positives", you need to
make sure you've tracked them all down before trying to ask that this
be merged.

You may also want to add some documentation about why we should trust
this; in particular for wait channels, when a process calls schedule()
there may be multiple reasons why the thread will wake up --- in the
worst case, such as in the select(2) or epoll(2) system call, there
may be literally thousands of reasons (one for every file desriptor
the select is waiting on) --- why the process will wake up and thus
resolve the potential "deadlock" that DEPT is worrying about.  How is
DEPT going to handle those cases?  If the answer is that things need
to be tagged, then at least disclose potential reasons why DEPT might
be untrustworthy to save your reviewers time.

I know that you're trying to help us, but this tool needs to be far
better than Lockdep before we should think about merging it.  Even if
it finds 5% more potential deadlocks, if it creates 95% more false
positive reports --- and the ones it finds are crazy things that
rarely actually happen in practice, are the costs worth the benefits?
And who is bearing the costs, and who is receiving the benefits?

Regards,

					- Ted

Steven Rostedt Feb. 17, 2022, 5 p.m. UTC | #5

On Thu, 17 Feb 2022 10:51:09 -0500
"Theodore Ts'o" <tytso@mit.edu> wrote:

> I know that you're trying to help us, but this tool needs to be far
> better than Lockdep before we should think about merging it.  Even if
> it finds 5% more potential deadlocks, if it creates 95% more false
> positive reports --- and the ones it finds are crazy things that
> rarely actually happen in practice, are the costs worth the benefits?
> And who is bearing the costs, and who is receiving the benefits?

I personally believe that there's potential that this can be helpful and we
will want to merge it.

But, what I believe Ted is trying to say is, if you do not know if the
report is a bug or not, please do not ask the maintainers to determine it
for you. This is a good opportunity for you to look to see why your tool
reported an issue, and learn that subsystem. Look at if this is really a
bug or not, and investigate why.

The likely/unlikely tracing I do finds issues all over the kernel. But
before I report anything, I look at the subsystem and determine *why* it's
reporting what it does. In some cases, it's just a config issue. Where, I
may submit a patch saying "this is 100% wrong in X config, and we should
just remove the "unlikely". But I did the due diligence to find out exactly
what the issue is, and why the tooling reported what it reported.

I want to stress that your Dept tooling looks to have the potential of
being something that will be worth while including. But the false positives
needs to be down to the rate of lockdep false positives. As Ted said, if
it's reporting 95% false positives, nobody is going to look at the 5% of
real bugs that it finds.

-- Steve

Matthew Wilcox Feb. 17, 2022, 5:06 p.m. UTC | #6

On Thu, Feb 17, 2022 at 12:00:05PM -0500, Steven Rostedt wrote:
> On Thu, 17 Feb 2022 10:51:09 -0500
> "Theodore Ts'o" <tytso@mit.edu> wrote:
> 
> > I know that you're trying to help us, but this tool needs to be far
> > better than Lockdep before we should think about merging it.  Even if
> > it finds 5% more potential deadlocks, if it creates 95% more false
> > positive reports --- and the ones it finds are crazy things that
> > rarely actually happen in practice, are the costs worth the benefits?
> > And who is bearing the costs, and who is receiving the benefits?
> 
> I personally believe that there's potential that this can be helpful and we
> will want to merge it.
> 
> But, what I believe Ted is trying to say is, if you do not know if the
> report is a bug or not, please do not ask the maintainers to determine it
> for you. This is a good opportunity for you to look to see why your tool
> reported an issue, and learn that subsystem. Look at if this is really a
> bug or not, and investigate why.

I agree with Steven here, to the point where I'm willing to invest some
time being a beta-tester for this, so if you focus your efforts on
filesystem/mm kinds of problems, I can continue looking at them and
tell you what's helpful and what's unhelpful in the reports.

Byungchul Park Feb. 18, 2022, 12:41 a.m. UTC | #7

On Thu, Feb 17, 2022 at 01:27:42PM +0000, Matthew Wilcox wrote:
> On Thu, Feb 17, 2022 at 08:10:03PM +0900, Byungchul Park wrote:
> > [    7.009608] ===================================================
> > [    7.009613] DEPT: Circular dependency has been detected.
> > [    7.009614] 5.17.0-rc1-00014-g8a599299c0cb-dirty #30 Tainted: G        W
> > [    7.009616] ---------------------------------------------------
> > [    7.009617] summary
> > [    7.009618] ---------------------------------------------------
> > [    7.009618] *** DEADLOCK ***
> > [    7.009618]
> > [    7.009619] context A
> > [    7.009619]     [S] (unknown)(&(bit_wait_table + i)->dmap:0)
> 
> Why is the context unknown here?  I don't see a way to debug this
> without knowing where we acquired the bit wait lock.

ideal view
------------------

context X			context A

request event E to context A	...
   write REQUESTEVENT		if (can see REQUESTEVENT written)
...				   notice the request from X [S]
wait for the event [W]		...
   write barrier
   write WAITSTART		
   actual wait			if (can see REQUESTEVENT written)
				   consider it's on the way to the event
...				
				...
				finally the event [E]

Dept works with the above view wrt. wait and event. [S] point varies
depending on ways to make context A notice the request so that the event
context A can start. Of course, by making more effort on identifying
each way in the kernel, we can figure out the exact point.

Here, we can also check WAITSTART with a read barrier instead of
REQUESTEVENT in context A conservatively. That's how Dept works.

Dept's view
------------------

context X			context A

request event E to context A	...
   write REQUESTEVENT		if (can see REQUESTEVENT written)
...				   notice the request from X [S]
wait for the event [W]		...
   write barrier
   write WAITSTART		read barrier
   actual wait			if (can see WAITSTART written)
				   consider it's on the way to the event
...				
				...
				finally the event [E]
				   consider all waits in context A so far
				   that could see WAITSTART written

Thanks,
Byungchul

> > [    7.009621]     [W] down_write(&ei->i_data_sem:0)
> > [    7.009623]     [E] event(&(bit_wait_table + i)->dmap:0)
> > [    7.009624]
> > [    7.009625] context B
> > [    7.009625]     [S] down_read(&ei->i_data_sem:0)
> > [    7.009626]     [W] wait(&(bit_wait_table + i)->dmap:0)
> > [    7.009627]     [E] up_read(&ei->i_data_sem:0)
> > [    7.009628]
> > [    7.009629] [S]: start of the event context
> > [    7.009629] [W]: the wait blocked
> > [    7.009630] [E]: the event not reachable
> > [    7.009631] ---------------------------------------------------
> > [    7.009631] context A's detail
> > [    7.009632] ---------------------------------------------------
> > [    7.009632] context A
> > [    7.009633]     [S] (unknown)(&(bit_wait_table + i)->dmap:0)
> > [    7.009634]     [W] down_write(&ei->i_data_sem:0)
> > [    7.009635]     [E] event(&(bit_wait_table + i)->dmap:0)
> > [    7.009636]
> > [    7.009636] [S] (unknown)(&(bit_wait_table + i)->dmap:0):
> > [    7.009638] (N/A)
> > [    7.009638]
> > [    7.009639] [W] down_write(&ei->i_data_sem:0):
> > [    7.009639] ext4_truncate (fs/ext4/inode.c:4187) 
> > [    7.009645] stacktrace:
> > [    7.009646] down_write (kernel/locking/rwsem.c:1514) 
> > [    7.009648] ext4_truncate (fs/ext4/inode.c:4187) 
> > [    7.009650] ext4_da_write_begin (./include/linux/fs.h:827 fs/ext4/truncate.h:23 fs/ext4/inode.c:2963) 
> > [    7.009652] generic_perform_write (mm/filemap.c:3784) 
> > [    7.009654] ext4_buffered_write_iter (fs/ext4/file.c:269) 
> > [    7.009657] ext4_file_write_iter (fs/ext4/file.c:677) 
> > [    7.009659] new_sync_write (fs/read_write.c:504 (discriminator 1)) 
> > [    7.009662] vfs_write (fs/read_write.c:590) 
> > [    7.009663] ksys_write (fs/read_write.c:644) 
> > [    7.009664] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) 
> > [    7.009667] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) 
> > [    7.009669]
> > [    7.009670] [E] event(&(bit_wait_table + i)->dmap:0):
> > [    7.009671] __wake_up_common (kernel/sched/wait.c:108) 
> > [    7.009673] stacktrace:
> > [    7.009674] dept_event (kernel/dependency/dept.c:2337) 
> > [    7.009677] __wake_up_common (kernel/sched/wait.c:109) 
> > [    7.009678] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1)) 
> > [    7.009679] __wake_up_bit (kernel/sched/wait_bit.c:127) 
> > [    7.009681] ext4_orphan_del (fs/ext4/orphan.c:282) 
> > [    7.009683] ext4_truncate (fs/ext4/inode.c:4212) 
> > [    7.009685] ext4_da_write_begin (./include/linux/fs.h:827 fs/ext4/truncate.h:23 fs/ext4/inode.c:2963) 
> > [    7.009687] generic_perform_write (mm/filemap.c:3784) 
> > [    7.009688] ext4_buffered_write_iter (fs/ext4/file.c:269) 
> > [    7.009690] ext4_file_write_iter (fs/ext4/file.c:677) 
> > [    7.009692] new_sync_write (fs/read_write.c:504 (discriminator 1)) 
> > [    7.009694] vfs_write (fs/read_write.c:590) 
> > [    7.009695] ksys_write (fs/read_write.c:644) 
> > [    7.009696] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) 
> > [    7.009698] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) 
> > [    7.009700] ---------------------------------------------------
> > [    7.009700] context B's detail
> > [    7.009701] ---------------------------------------------------
> > [    7.009702] context B
> > [    7.009702]     [S] down_read(&ei->i_data_sem:0)
> > [    7.009703]     [W] wait(&(bit_wait_table + i)->dmap:0)
> > [    7.009704]     [E] up_read(&ei->i_data_sem:0)
> > [    7.009705]
> > [    7.009706] [S] down_read(&ei->i_data_sem:0):
> > [    7.009707] ext4_map_blocks (./arch/x86/include/asm/bitops.h:207 ./include/asm-generic/bitops/instrumented-non-atomic.h:135 fs/ext4/ext4.h:1918 fs/ext4/inode.c:562) 
> > [    7.009709] stacktrace:
> > [    7.009709] down_read (kernel/locking/rwsem.c:1461) 
> > [    7.009711] ext4_map_blocks (./arch/x86/include/asm/bitops.h:207 ./include/asm-generic/bitops/instrumented-non-atomic.h:135 fs/ext4/ext4.h:1918 fs/ext4/inode.c:562) 
> > [    7.009712] ext4_getblk (fs/ext4/inode.c:851) 
> > [    7.009714] ext4_bread (fs/ext4/inode.c:903) 
> > [    7.009715] __ext4_read_dirblock (fs/ext4/namei.c:117) 
> > [    7.009718] dx_probe (fs/ext4/namei.c:789) 
> > [    7.009720] ext4_dx_find_entry (fs/ext4/namei.c:1721) 
> > [    7.009722] __ext4_find_entry (fs/ext4/namei.c:1571) 
> > [    7.009723] ext4_lookup (fs/ext4/namei.c:1770) 
> > [    7.009725] lookup_open (./include/linux/dcache.h:361 fs/namei.c:3310) 
> > [    7.009727] path_openat (fs/namei.c:3401 fs/namei.c:3605) 
> > [    7.009729] do_filp_open (fs/namei.c:3637) 
> > [    7.009731] do_sys_openat2 (fs/open.c:1215) 
> > [    7.009732] do_sys_open (fs/open.c:1231) 
> > [    7.009734] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) 
> > [    7.009736] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) 
> > [    7.009738]
> > [    7.009738] [W] wait(&(bit_wait_table + i)->dmap:0):
> > [    7.009739] prepare_to_wait (kernel/sched/wait.c:275) 
> > [    7.009741] stacktrace:
> > [    7.009741] __schedule (kernel/sched/sched.h:1318 kernel/sched/sched.h:1616 kernel/sched/core.c:6213) 
> > [    7.009743] schedule (kernel/sched/core.c:6373 (discriminator 1)) 
> > [    7.009744] io_schedule (./arch/x86/include/asm/current.h:15 kernel/sched/core.c:8392 kernel/sched/core.c:8418) 
> > [    7.009745] bit_wait_io (./arch/x86/include/asm/current.h:15 kernel/sched/wait_bit.c:210) 
> > [    7.009746] __wait_on_bit (kernel/sched/wait_bit.c:49) 
> > [    7.009748] out_of_line_wait_on_bit (kernel/sched/wait_bit.c:65) 
> > [    7.009749] ext4_read_bh (./arch/x86/include/asm/bitops.h:207 ./include/asm-generic/bitops/instrumented-non-atomic.h:135 ./include/linux/buffer_head.h:120 fs/ext4/super.c:201) 
> > [    7.009752] __read_extent_tree_block (fs/ext4/extents.c:545) 
> > [    7.009754] ext4_find_extent (fs/ext4/extents.c:928) 
> > [    7.009756] ext4_ext_map_blocks (fs/ext4/extents.c:4099) 
> > [    7.009757] ext4_map_blocks (fs/ext4/inode.c:563) 
> > [    7.009759] ext4_getblk (fs/ext4/inode.c:851) 
> > [    7.009760] ext4_bread (fs/ext4/inode.c:903) 
> > [    7.009762] __ext4_read_dirblock (fs/ext4/namei.c:117) 
> > [    7.009764] dx_probe (fs/ext4/namei.c:789) 
> > [    7.009765] ext4_dx_find_entry (fs/ext4/namei.c:1721) 
> > [    7.009767]
> > [    7.009768] [E] up_read(&ei->i_data_sem:0):
> > [    7.009769] ext4_map_blocks (fs/ext4/inode.c:593) 
> > [    7.009771] stacktrace:
> > [    7.009771] up_read (kernel/locking/rwsem.c:1556) 
> > [    7.009774] ext4_map_blocks (fs/ext4/inode.c:593) 
> > [    7.009775] ext4_getblk (fs/ext4/inode.c:851) 
> > [    7.009777] ext4_bread (fs/ext4/inode.c:903) 
> > [    7.009778] __ext4_read_dirblock (fs/ext4/namei.c:117) 
> > [    7.009780] dx_probe (fs/ext4/namei.c:789) 
> > [    7.009782] ext4_dx_find_entry (fs/ext4/namei.c:1721) 
> > [    7.009784] __ext4_find_entry (fs/ext4/namei.c:1571) 
> > [    7.009786] ext4_lookup (fs/ext4/namei.c:1770) 
> > [    7.009788] lookup_open (./include/linux/dcache.h:361 fs/namei.c:3310) 
> > [    7.009789] path_openat (fs/namei.c:3401 fs/namei.c:3605) 
> > [    7.009791] do_filp_open (fs/namei.c:3637) 
> > [    7.009792] do_sys_openat2 (fs/open.c:1215) 
> > [    7.009794] do_sys_open (fs/open.c:1231) 
> > [    7.009795] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) 
> > [    7.009797] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) 
> > [    7.009799] ---------------------------------------------------
> > [    7.009800] information that might be helpful
> > [    7.009800] ---------------------------------------------------
> > [    7.009801] CPU: 0 PID: 611 Comm: rs:main Q:Reg Tainted: G        W         5.17.0-rc1-00014-g8a599299c0cb-dirty #30
> > [    7.009804] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> > [    7.009805] Call Trace:
> > [    7.009806]  <TASK>
> > [    7.009807] dump_stack_lvl (lib/dump_stack.c:107) 
> > [    7.009809] print_circle (./arch/x86/include/asm/atomic.h:108 ./include/linux/atomic/atomic-instrumented.h:258 kernel/dependency/dept.c:157 kernel/dependency/dept.c:762) 
> > [    7.009812] ? print_circle (kernel/dependency/dept.c:1086) 
> > [    7.009814] cb_check_dl (kernel/dependency/dept.c:1104) 
> > [    7.009815] bfs (kernel/dependency/dept.c:860) 
> > [    7.009818] add_dep (kernel/dependency/dept.c:1423) 
> > [    7.009820] do_event.isra.25 (kernel/dependency/dept.c:1650) 
> > [    7.009822] ? __wake_up_common (kernel/sched/wait.c:108) 
> > [    7.009824] dept_event (kernel/dependency/dept.c:2337) 
> > [    7.009826] __wake_up_common (kernel/sched/wait.c:109) 
> > [    7.009828] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1)) 
> > [    7.009830] __wake_up_bit (kernel/sched/wait_bit.c:127) 
> > [    7.009832] ext4_orphan_del (fs/ext4/orphan.c:282) 
> > [    7.009835] ? dept_ecxt_exit (./arch/x86/include/asm/current.h:15 kernel/dependency/dept.c:241 kernel/dependency/dept.c:999 kernel/dependency/dept.c:1043 kernel/dependency/dept.c:2478) 
> > [    7.009837] ext4_truncate (fs/ext4/inode.c:4212) 
> > [    7.009839] ext4_da_write_begin (./include/linux/fs.h:827 fs/ext4/truncate.h:23 fs/ext4/inode.c:2963) 
> > [    7.009842] generic_perform_write (mm/filemap.c:3784) 
> > [    7.009845] ext4_buffered_write_iter (fs/ext4/file.c:269) 
> > [    7.009848] ext4_file_write_iter (fs/ext4/file.c:677) 
> > [    7.009851] new_sync_write (fs/read_write.c:504 (discriminator 1)) 
> > [    7.009854] vfs_write (fs/read_write.c:590) 
> > [    7.009856] ksys_write (fs/read_write.c:644) 
> > [    7.009857] ? trace_hardirqs_on (kernel/trace/trace_preemptirq.c:65) 
> > [    7.009860] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) 
> > [    7.009862] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) 
> > [    7.009865] RIP: 0033:0x7f3b160b335d
> > [ 7.009867] Code: e1 20 00 00 75 10 b8 01 00 00 00 0f 05 48 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 ce fa ff ff 48 89 04 24 b8 01 00 00 00 0f 05 <48> 8b 3c 24 48 89 c2 e8 17 fb ff ff 48 89 d0 48 83 c4 08 48 3d 01
> > All code
> > ========
> >    0:	e1 20                	loope  0x22
> >    2:	00 00                	add    %al,(%rax)
> >    4:	75 10                	jne    0x16
> >    6:	b8 01 00 00 00       	mov    $0x1,%eax
> >    b:	0f 05                	syscall 
> >    d:	48 3d 01 f0 ff ff    	cmp    $0xfffffffffffff001,%rax
> >   13:	73 31                	jae    0x46
> >   15:	c3                   	retq   
> >   16:	48 83 ec 08          	sub    $0x8,%rsp
> >   1a:	e8 ce fa ff ff       	callq  0xfffffffffffffaed
> >   1f:	48 89 04 24          	mov    %rax,(%rsp)
> >   23:	b8 01 00 00 00       	mov    $0x1,%eax
> >   28:	0f 05                	syscall 
> >   2a:*	48 8b 3c 24          	mov    (%rsp),%rdi		<-- trapping instruction
> >   2e:	48 89 c2             	mov    %rax,%rdx
> >   31:	e8 17 fb ff ff       	callq  0xfffffffffffffb4d
> >   36:	48 89 d0             	mov    %rdx,%rax
> >   39:	48 83 c4 08          	add    $0x8,%rsp
> >   3d:	48                   	rex.W
> >   3e:	3d                   	.byte 0x3d
> >   3f:	01                   	.byte 0x1
> > 
> > Code starting with the faulting instruction
> > ===========================================
> >    0:	48 8b 3c 24          	mov    (%rsp),%rdi
> >    4:	48 89 c2             	mov    %rax,%rdx
> >    7:	e8 17 fb ff ff       	callq  0xfffffffffffffb23
> >    c:	48 89 d0             	mov    %rdx,%rax
> >    f:	48 83 c4 08          	add    $0x8,%rsp
> >   13:	48                   	rex.W
> >   14:	3d                   	.byte 0x3d
> >   15:	01                   	.byte 0x1
> > [    7.009869] RSP: 002b:00007f3b1340f180 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
> > [    7.009871] RAX: ffffffffffffffda RBX: 00007f3b040010a0 RCX: 00007f3b160b335d
> > [    7.009873] RDX: 0000000000000300 RSI: 00007f3b040010a0 RDI: 0000000000000001
> > [    7.009874] RBP: 0000000000000000 R08: fffffffffffffa15 R09: fffffffffffffa05
> > [    7.009875] R10: 0000000000000000 R11: 0000000000000293 R12: 00007f3b04000df0
> > [    7.009876] R13: 00007f3b1340f1a0 R14: 0000000000000220 R15: 0000000000000300
> > [    7.009879]  </TASK>

Theodore Ts'o Feb. 18, 2022, 4:19 a.m. UTC | #8

On Thu, Feb 17, 2022 at 12:00:05PM -0500, Steven Rostedt wrote:
> 
> I personally believe that there's potential that this can be helpful and we
> will want to merge it.
> 
> But, what I believe Ted is trying to say is, if you do not know if the
> report is a bug or not, please do not ask the maintainers to determine it
> for you. This is a good opportunity for you to look to see why your tool
> reported an issue, and learn that subsystem. Look at if this is really a
> bug or not, and investigate why.

I agree there's potential here, or I would have ignored the ext4 "bug
report".

When we can get rid of the false positives, I think it should be
merged; I'd just rather it not be merged until after the false
positives are fixed, since otherwise, someone well-meaning will start
using it with Syzkaller, and noise that maintainers need to deal with
(with people requesting reverts of two year old commits, etc) will
increase by a factor of ten or more.  (With Syzbot reproducers that
set up random cgroups, IP tunnels with wiregaurd enabled, FUSE stress
testers, etc., that file system maintainers will be asked to try to
disentangle.)

So from a maintainer's perspective, false positives are highly
negative.  It may be that from some people's POV, one bug found and 20
false positive might still be "useful".  But if your tool gains a
reputation of not valuing maintainers' time, it's just going to make
us (or at least me :-) cranky, and it's going to be very hard to
recover from perception.  So it's probably better to be very
conservative and careful in polishing it before asking for it to be
merged.

Cheers,

						- Ted

Byungchul Park Feb. 19, 2022, 9:54 a.m. UTC | #9

On Thu, Feb 17, 2022 at 10:51:09AM -0500, Theodore Ts'o wrote:
> On Thu, Feb 17, 2022 at 07:57:36PM +0900, Byungchul Park wrote:
> > 
> > I've got several reports from the tool. Some of them look like false
> > alarms and some others look like real deadlock possibility. Because of
> > my unfamiliarity of the domain, it's hard to confirm if it's a real one.
> > Let me add the reports on this email thread.
> 
> The problem is we have so many potentially invalid, or
> so-rare-as-to-be-not-worth-the-time-to-investigate-in-the-
> grand-scheme-of-all-of-the-fires-burning-on-maintainers laps that it's
> really not reasonable to ask maintainers to determine whether

Even though I might have been wrong and might be gonna be wrong, you
look so arrogant. You were hasty to judge and trying to walk over me.

I reported it because I thought it was a real problem but couldn't
confirm it. For the other reports that I thought was not real, I didn't
even mention it. If you are talking about the previous report, then I
felt so sorry as I told you. I skimmed through the part of the waits...

Basically, I respect you and appreciate your feedback. Hope you not get
me wrong.

> Looking at the second ext4 report, it doesn't make any sense.  Context
> A is the kjournald thread.  We don't do a commit until (a) the timeout
> expires, or (b) someone explicitly requests that a commit happen
> waking up j_wait_commit.  I'm guessing that complaint here is that
> DEPT thinks nothing is explicitly requesting a wake up.  But note that
> after 5 seconds (or whatever journal->j_commit_interval) is configured
> to be we *will* always start a commit.  So ergo, there can't be a deadlock.

Yeah, it might not be a *deadlock deadlock* because the wait will be
anyway woken up by one of the wake up points you mentioned. However, the
dependency looks problematic because the three contexts participating in
the dependency chain would be stuck for a while until one eventually
wakes it up. I bet it would not be what you meant.

Again. It's not critical but problematic. Or am I missing something?

> At a higher level of discussion, it's an unfair tax on maintainer's
> times to ask maintainers to help you debug DEPT for you.  Tools like
> Syzkaller and DEPT are useful insofar as they save us time in making
> our subsystems better.  But until you can prove that it's not going to
> be a massive denial of service attack on maintainer's time, at the

Partially I agree. I would understand you even if you don't support Dept
until you think it's valuable enough. However, let me keep asking things
to fs folks, not you, even though I would cc you on it.

> If you know there there "appear to be false positives", you need to
> make sure you've tracked them all down before trying to ask that this
> be merged.

To track them all down, I need to ask LKML because Dept works perfectly
with my system. I don't want it to be merged with a lot of false
positive still in there, either.

> You may also want to add some documentation about why we should trust
> this; in particular for wait channels, when a process calls schedule()
> there may be multiple reasons why the thread will wake up --- in the
> worst case, such as in the select(2) or epoll(2) system call, there
> may be literally thousands of reasons (one for every file desriptor
> the select is waiting on) --- why the process will wake up and thus
> resolve the potential "deadlock" that DEPT is worrying about.  How is
> DEPT going to handle those cases?  If the answer is that things need

Thank you for the information but I don't get it which case you are
concerning. I'd like to ask you a specific senario of that so that we
can discuss it more - maybe I guess I could answer to it tho, but I
won't ask you. Just give me an instance only if you think it's worthy.

You look like a guy who unconditionally blames on new things before
understanding it rather than asking and discussing. Again. I also think
anyone doesn't have to spend his or her time for what he or she think is
not worthy enough.

> I know that you're trying to help us, but this tool needs to be far
> better than Lockdep before we should think about merging it.  Even if
> it finds 5% more potential deadlocks, if it creates 95% more false

It should not get merged for sure if so, but it sounds too sarcastic.
Let's see if it creates 95% false positives for real. If it's true and
I can't control it, I will give up. That's what I should do.

There are a lot of factors to judge how valuable Dept is. Dept would be
useful especially in the middle of development, rather than in the final
state in the tree. It'd be appreciated if you think that sides more, too.

Thanks,
Byungchul

Byungchul Park Feb. 19, 2022, 10:05 a.m. UTC | #10

On Thu, Feb 17, 2022 at 05:06:18PM +0000, Matthew Wilcox wrote:
> On Thu, Feb 17, 2022 at 12:00:05PM -0500, Steven Rostedt wrote:
> > On Thu, 17 Feb 2022 10:51:09 -0500
> > "Theodore Ts'o" <tytso@mit.edu> wrote:
> > 
> > > I know that you're trying to help us, but this tool needs to be far
> > > better than Lockdep before we should think about merging it.  Even if
> > > it finds 5% more potential deadlocks, if it creates 95% more false
> > > positive reports --- and the ones it finds are crazy things that
> > > rarely actually happen in practice, are the costs worth the benefits?
> > > And who is bearing the costs, and who is receiving the benefits?
> > 
> > I personally believe that there's potential that this can be helpful and we
> > will want to merge it.
> > 
> > But, what I believe Ted is trying to say is, if you do not know if the
> > report is a bug or not, please do not ask the maintainers to determine it
> > for you. This is a good opportunity for you to look to see why your tool
> > reported an issue, and learn that subsystem. Look at if this is really a
> > bug or not, and investigate why.
> 
> I agree with Steven here, to the point where I'm willing to invest some
> time being a beta-tester for this, so if you focus your efforts on
> filesystem/mm kinds of problems, I can continue looking at them and
> tell you what's helpful and what's unhelpful in the reports.

I appreciate your support. I'll do my best to make it *THE* perfect tool
for that purpose. I'd feel great if it helps a lot and saves you guys'
time in advance, it might not for now tho.

Thanks,
Byungchul

Byungchul Park Feb. 19, 2022, 10:18 a.m. UTC | #11

On Thu, Feb 17, 2022 at 12:00:05PM -0500, Steven Rostedt wrote:
> On Thu, 17 Feb 2022 10:51:09 -0500
> "Theodore Ts'o" <tytso@mit.edu> wrote:
> 
> > I know that you're trying to help us, but this tool needs to be far
> > better than Lockdep before we should think about merging it.  Even if
> > it finds 5% more potential deadlocks, if it creates 95% more false
> > positive reports --- and the ones it finds are crazy things that
> > rarely actually happen in practice, are the costs worth the benefits?
> > And who is bearing the costs, and who is receiving the benefits?
> 
> I personally believe that there's potential that this can be helpful and we
> will want to merge it.
> 
> But, what I believe Ted is trying to say is, if you do not know if the
> report is a bug or not, please do not ask the maintainers to determine it
> for you. This is a good opportunity for you to look to see why your tool
> reported an issue, and learn that subsystem. Look at if this is really a
> bug or not, and investigate why.

Appreciate your feedback. I'll be more careful in reporting things, and
I think I need to make it more conservative...

> The likely/unlikely tracing I do finds issues all over the kernel. But
> before I report anything, I look at the subsystem and determine *why* it's
> reporting what it does. In some cases, it's just a config issue. Where, I
> may submit a patch saying "this is 100% wrong in X config, and we should
> just remove the "unlikely". But I did the due diligence to find out exactly
> what the issue is, and why the tooling reported what it reported.

I'll try my best to do things that way. However, thing is that there's
few reports with my system... That's why I shared Dept in LKML space.

> I want to stress that your Dept tooling looks to have the potential of
> being something that will be worth while including. But the false positives
> needs to be down to the rate of lockdep false positives. As Ted said, if
> it's reporting 95% false positives, nobody is going to look at the 5% of
> real bugs that it finds.

Agree. Dept should not be merged if so. I'm not pushing ahead, but I'm
convinced that Dept works what a dependency tracker should do. Let's see
how valuable it is esp. in the middle of developing something in the
kernel.

Thanks,
Byungchul

Byungchul Park Feb. 19, 2022, 10:34 a.m. UTC | #12

On Thu, Feb 17, 2022 at 11:19:15PM -0500, Theodore Ts'o wrote:
> On Thu, Feb 17, 2022 at 12:00:05PM -0500, Steven Rostedt wrote:
> > 
> > I personally believe that there's potential that this can be helpful and we
> > will want to merge it.
> > 
> > But, what I believe Ted is trying to say is, if you do not know if the
> > report is a bug or not, please do not ask the maintainers to determine it
> > for you. This is a good opportunity for you to look to see why your tool
> > reported an issue, and learn that subsystem. Look at if this is really a
> > bug or not, and investigate why.
> 
> I agree there's potential here, or I would have ignored the ext4 "bug
> report".

I just checked this one. Appreciate it...

> When we can get rid of the false positives, I think it should be

Of course, the false positives should be removed once it's found. I will
try my best to remove all of those on my own as much as possible.
However, thing is I can't see others than what I can see with my system.

> merged; I'd just rather it not be merged until after the false
> positives are fixed, since otherwise, someone well-meaning will start
> using it with Syzkaller, and noise that maintainers need to deal with
> (with people requesting reverts of two year old commits, etc) will
> increase by a factor of ten or more.  (With Syzbot reproducers that

Agree.

> set up random cgroups, IP tunnels with wiregaurd enabled, FUSE stress
> testers, etc., that file system maintainers will be asked to try to
> disentangle.)
> 
> So from a maintainer's perspective, false positives are highly
> negative.  It may be that from some people's POV, one bug found and 20
> false positive might still be "useful".  But if your tool gains a
> reputation of not valuing maintainers' time, it's just going to make
> us (or at least me :-) cranky, and it's going to be very hard to

Agree.

> recover from perception.  So it's probably better to be very
> conservative and careful in polishing it before asking for it to be
> merged.

If it's true that there are too many false positives like 95%, then I'll
fix those fist for sure before asking to merge it. Let's see if so.

To kernel developers,

It'd be appreciated if you'd let us know if you can see real ones than
false positives in the middle of developing something in the kernel so
it's useful. Otherwise, it's hard to measure how many false positives it
reports and how valuable it is and so on...

Thanks,
Byungchul

Jan Kara Feb. 21, 2022, 7:02 p.m. UTC | #13

So I was trying to understand what this report is about for some time but
honestly I have failed...

On Thu 17-02-22 20:10:04, Byungchul Park wrote:
> [    9.008161] ===================================================
> [    9.008163] DEPT: Circular dependency has been detected.
> [    9.008164] 5.17.0-rc1-00015-gb94f67143867-dirty #2 Tainted: G        W
> [    9.008166] ---------------------------------------------------
> [    9.008167] summary
> [    9.008167] ---------------------------------------------------
> [    9.008168] *** DEADLOCK ***
> [    9.008168]
> [    9.008168] context A
> [    9.008169]     [S] (unknown)(&(&journal->j_wait_transaction_locked)->dmap:0)
> [    9.008171]     [W] wait(&(&journal->j_wait_commit)->dmap:0)
> [    9.008172]     [E] event(&(&journal->j_wait_transaction_locked)->dmap:0)
> [    9.008173]
> [    9.008173] context B
> [    9.008174]     [S] down_write(mapping.invalidate_lock:0)
> [    9.008175]     [W] wait(&(&journal->j_wait_transaction_locked)->dmap:0)
> [    9.008176]     [E] up_write(mapping.invalidate_lock:0)
> [    9.008177]
> [    9.008178] context C
> [    9.008179]     [S] (unknown)(&(&journal->j_wait_commit)->dmap:0)
> [    9.008180]     [W] down_write(mapping.invalidate_lock:0)
> [    9.008181]     [E] event(&(&journal->j_wait_commit)->dmap:0)
> [    9.008181]
> [    9.008182] [S]: start of the event context
> [    9.008183] [W]: the wait blocked
> [    9.008183] [E]: the event not reachable

So what situation is your tool complaining about here? Can you perhaps show
it here in more common visualization like:

TASK1				TASK2
				does foo, grabs Z
does X, grabs lock Y
blocks on Z
				blocks on Y

or something like that? Because I was not able to decipher this from the
report even after trying for some time...

								Honza

				

> [    9.008184] ---------------------------------------------------
> [    9.008184] context A's detail
> [    9.008185] ---------------------------------------------------
> [    9.008186] context A
> [    9.008186]     [S] (unknown)(&(&journal->j_wait_transaction_locked)->dmap:0)
> [    9.008187]     [W] wait(&(&journal->j_wait_commit)->dmap:0)
> [    9.008188]     [E] event(&(&journal->j_wait_transaction_locked)->dmap:0)
> [    9.008189]
> [    9.008190] [S] (unknown)(&(&journal->j_wait_transaction_locked)->dmap:0):
> [    9.008191] (N/A)
> [    9.008191]
> [    9.008192] [W] wait(&(&journal->j_wait_commit)->dmap:0):
> [    9.008193] prepare_to_wait (kernel/sched/wait.c:275) 
> [    9.008197] stacktrace:
> [    9.008198] __schedule (kernel/sched/sched.h:1318 kernel/sched/sched.h:1616 kernel/sched/core.c:6213) 
> [    9.008200] schedule (kernel/sched/core.c:6373 (discriminator 1)) 
> [    9.008201] kjournald2 (fs/jbd2/journal.c:250) 
> [    9.008203] kthread (kernel/kthread.c:377) 
> [    9.008206] ret_from_fork (arch/x86/entry/entry_64.S:301) 
> [    9.008209]
> [    9.008209] [E] event(&(&journal->j_wait_transaction_locked)->dmap:0):
> [    9.008210] __wake_up_common (kernel/sched/wait.c:108) 
> [    9.008212] stacktrace:
> [    9.008213] dept_event (kernel/dependency/dept.c:2337) 
> [    9.008215] __wake_up_common (kernel/sched/wait.c:109) 
> [    9.008217] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1)) 
> [    9.008218] jbd2_journal_commit_transaction (fs/jbd2/commit.c:583) 
> [    9.008221] kjournald2 (fs/jbd2/journal.c:214 (discriminator 3)) 
> [    9.008223] kthread (kernel/kthread.c:377) 
> [    9.008224] ret_from_fork (arch/x86/entry/entry_64.S:301) 
> [    9.008226] ---------------------------------------------------
> [    9.008226] context B's detail
> [    9.008227] ---------------------------------------------------
> [    9.008228] context B
> [    9.008228]     [S] down_write(mapping.invalidate_lock:0)
> [    9.008229]     [W] wait(&(&journal->j_wait_transaction_locked)->dmap:0)
> [    9.008230]     [E] up_write(mapping.invalidate_lock:0)
> [    9.008231]
> [    9.008232] [S] down_write(mapping.invalidate_lock:0):
> [    9.008233] ext4_da_write_begin (fs/ext4/truncate.h:21 fs/ext4/inode.c:2963) 
> [    9.008237] stacktrace:
> [    9.008237] down_write (kernel/locking/rwsem.c:1514) 
> [    9.008239] ext4_da_write_begin (fs/ext4/truncate.h:21 fs/ext4/inode.c:2963) 
> [    9.008241] generic_perform_write (mm/filemap.c:3784) 
> [    9.008243] ext4_buffered_write_iter (fs/ext4/file.c:269) 
> [    9.008245] ext4_file_write_iter (fs/ext4/file.c:677) 
> [    9.008247] new_sync_write (fs/read_write.c:504 (discriminator 1)) 
> [    9.008250] vfs_write (fs/read_write.c:590) 
> [    9.008251] ksys_write (fs/read_write.c:644) 
> [    9.008253] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) 
> [    9.008255] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) 
> [    9.008258]
> [    9.008258] [W] wait(&(&journal->j_wait_transaction_locked)->dmap:0):
> [    9.008259] prepare_to_wait (kernel/sched/wait.c:275) 
> [    9.008261] stacktrace:
> [    9.008261] __schedule (kernel/sched/sched.h:1318 kernel/sched/sched.h:1616 kernel/sched/core.c:6213) 
> [    9.008263] schedule (kernel/sched/core.c:6373 (discriminator 1)) 
> [    9.008264] wait_transaction_locked (fs/jbd2/transaction.c:184) 
> [    9.008266] add_transaction_credits (fs/jbd2/transaction.c:248 (discriminator 3)) 
> [    9.008267] start_this_handle (fs/jbd2/transaction.c:427) 
> [    9.008269] jbd2__journal_start (fs/jbd2/transaction.c:526) 
> [    9.008271] __ext4_journal_start_sb (fs/ext4/ext4_jbd2.c:105) 
> [    9.008273] ext4_truncate (fs/ext4/inode.c:4164) 
> [    9.008274] ext4_da_write_begin (./include/linux/fs.h:827 fs/ext4/truncate.h:23 fs/ext4/inode.c:2963) 
> [    9.008276] generic_perform_write (mm/filemap.c:3784) 
> [    9.008277] ext4_buffered_write_iter (fs/ext4/file.c:269) 
> [    9.008279] ext4_file_write_iter (fs/ext4/file.c:677) 
> [    9.008281] new_sync_write (fs/read_write.c:504 (discriminator 1)) 
> [    9.008283] vfs_write (fs/read_write.c:590) 
> [    9.008284] ksys_write (fs/read_write.c:644) 
> [    9.008285] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) 
> [    9.008287]
> [    9.008288] [E] up_write(mapping.invalidate_lock:0):
> [    9.008288] ext4_da_get_block_prep (fs/ext4/inode.c:1795 fs/ext4/inode.c:1829) 
> [    9.008291] ---------------------------------------------------
> [    9.008291] context C's detail
> [    9.008292] ---------------------------------------------------
> [    9.008292] context C
> [    9.008293]     [S] (unknown)(&(&journal->j_wait_commit)->dmap:0)
> [    9.008294]     [W] down_write(mapping.invalidate_lock:0)
> [    9.008295]     [E] event(&(&journal->j_wait_commit)->dmap:0)
> [    9.008296]
> [    9.008297] [S] (unknown)(&(&journal->j_wait_commit)->dmap:0):
> [    9.008298] (N/A)
> [    9.008298]
> [    9.008299] [W] down_write(mapping.invalidate_lock:0):
> [    9.008299] ext4_da_write_begin (fs/ext4/truncate.h:21 fs/ext4/inode.c:2963) 
> [    9.008302] stacktrace:
> [    9.008302] down_write (kernel/locking/rwsem.c:1514) 
> [    9.008304] ext4_da_write_begin (fs/ext4/truncate.h:21 fs/ext4/inode.c:2963) 
> [    9.008305] generic_perform_write (mm/filemap.c:3784) 
> [    9.008307] ext4_buffered_write_iter (fs/ext4/file.c:269) 
> [    9.008309] ext4_file_write_iter (fs/ext4/file.c:677) 
> [    9.008311] new_sync_write (fs/read_write.c:504 (discriminator 1)) 
> [    9.008312] vfs_write (fs/read_write.c:590) 
> [    9.008314] ksys_write (fs/read_write.c:644) 
> [    9.008315] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) 
> [    9.008316] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) 
> [    9.008318]
> [    9.008319] [E] event(&(&journal->j_wait_commit)->dmap:0):
> [    9.008320] __wake_up_common (kernel/sched/wait.c:108) 
> [    9.008321] stacktrace:
> [    9.008322] __wake_up_common (kernel/sched/wait.c:109) 
> [    9.008323] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1)) 
> [    9.008324] __jbd2_log_start_commit (fs/jbd2/journal.c:508) 
> [    9.008326] jbd2_log_start_commit (fs/jbd2/journal.c:527) 
> [    9.008327] __jbd2_journal_force_commit (fs/jbd2/journal.c:560) 
> [    9.008329] jbd2_journal_force_commit_nested (fs/jbd2/journal.c:583) 
> [    9.008331] ext4_should_retry_alloc (fs/ext4/balloc.c:670 (discriminator 3)) 
> [    9.008332] ext4_da_write_begin (fs/ext4/inode.c:2965 (discriminator 1)) 
> [    9.008334] generic_perform_write (mm/filemap.c:3784) 
> [    9.008335] ext4_buffered_write_iter (fs/ext4/file.c:269) 
> [    9.008337] ext4_file_write_iter (fs/ext4/file.c:677) 
> [    9.008339] new_sync_write (fs/read_write.c:504 (discriminator 1)) 
> [    9.008341] vfs_write (fs/read_write.c:590) 
> [    9.008342] ksys_write (fs/read_write.c:644) 
> [    9.008343] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) 
> [    9.008345] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) 
> [    9.008347] ---------------------------------------------------
> [    9.008348] information that might be helpful
> [    9.008348] ---------------------------------------------------
> [    9.008349] CPU: 0 PID: 89 Comm: jbd2/sda1-8 Tainted: G        W         5.17.0-rc1-00015-gb94f67143867-dirty #2
> [    9.008352] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> [    9.008353] Call Trace:
> [    9.008354]  <TASK>
> [    9.008355] dump_stack_lvl (lib/dump_stack.c:107) 
> [    9.008358] print_circle (./arch/x86/include/asm/atomic.h:108 ./include/linux/atomic/atomic-instrumented.h:258 kernel/dependency/dept.c:157 kernel/dependency/dept.c:762) 
> [    9.008360] ? print_circle (kernel/dependency/dept.c:1086) 
> [    9.008362] cb_check_dl (kernel/dependency/dept.c:1104) 
> [    9.008364] bfs (kernel/dependency/dept.c:860) 
> [    9.008366] add_dep (kernel/dependency/dept.c:1423) 
> [    9.008368] do_event.isra.25 (kernel/dependency/dept.c:1651) 
> [    9.008370] ? __wake_up_common (kernel/sched/wait.c:108) 
> [    9.008372] dept_event (kernel/dependency/dept.c:2337) 
> [    9.008374] __wake_up_common (kernel/sched/wait.c:109) 
> [    9.008376] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1)) 
> [    9.008379] jbd2_journal_commit_transaction (fs/jbd2/commit.c:583) 
> [    9.008381] ? arch_stack_walk (arch/x86/kernel/stacktrace.c:24) 
> [    9.008385] ? ret_from_fork (arch/x86/entry/entry_64.S:301) 
> [    9.008387] ? dept_enable_hardirq (./arch/x86/include/asm/current.h:15 kernel/dependency/dept.c:241 kernel/dependency/dept.c:999 kernel/dependency/dept.c:1043 kernel/dependency/dept.c:1843) 
> [    9.008389] ? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/irqflags.h:45 ./arch/x86/include/asm/irqflags.h:80 ./arch/x86/include/asm/irqflags.h:138 ./include/linux/spinlock_api_smp.h:151 kernel/locking/spinlock.c:194) 
> [    9.008392] ? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/preempt.h:103 ./include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:194) 
> [    9.008394] ? try_to_del_timer_sync (kernel/time/timer.c:1239) 
> [    9.008396] kjournald2 (fs/jbd2/journal.c:214 (discriminator 3)) 
> [    9.008398] ? prepare_to_wait_exclusive (kernel/sched/wait.c:431) 
> [    9.008400] ? commit_timeout (fs/jbd2/journal.c:173) 
> [    9.008402] kthread (kernel/kthread.c:377) 
> [    9.008404] ? kthread_complete_and_exit (kernel/kthread.c:332) 
> [    9.008407] ret_from_fork (arch/x86/entry/entry_64.S:301) 
> [    9.008410]  </TASK>

Jan Kara Feb. 22, 2022, 8:27 a.m. UTC | #14

On Thu 17-02-22 20:10:03, Byungchul Park wrote:
> [    7.009608] ===================================================
> [    7.009613] DEPT: Circular dependency has been detected.
> [    7.009614] 5.17.0-rc1-00014-g8a599299c0cb-dirty #30 Tainted: G        W
> [    7.009616] ---------------------------------------------------
> [    7.009617] summary
> [    7.009618] ---------------------------------------------------
> [    7.009618] *** DEADLOCK ***
> [    7.009618]
> [    7.009619] context A
> [    7.009619]     [S] (unknown)(&(bit_wait_table + i)->dmap:0)
> [    7.009621]     [W] down_write(&ei->i_data_sem:0)
> [    7.009623]     [E] event(&(bit_wait_table + i)->dmap:0)
> [    7.009624]
> [    7.009625] context B
> [    7.009625]     [S] down_read(&ei->i_data_sem:0)
> [    7.009626]     [W] wait(&(bit_wait_table + i)->dmap:0)
> [    7.009627]     [E] up_read(&ei->i_data_sem:0)
> [    7.009628]

Looking into this I have noticed that Dept here tracks bitlocks (buffer
locks in particular) but it apparently treats locks on all buffers as one
locking class so it conflates lock on superblock buffer with a lock on
extent tree block buffer. These are wastly different locks with different
locking constraints. So to avoid false positives in filesystems we will
need to add annotations to differentiate locks on different buffers (based
on what the block is used for). Similarly how we e.g. annotate i_rwsem for
different inodes.

								Honza

Byungchul Park Feb. 23, 2022, 12:35 a.m. UTC | #15

On Mon, Feb 21, 2022 at 08:02:04PM +0100, Jan Kara wrote:
> On Thu 17-02-22 20:10:04, Byungchul Park wrote:
> > [    9.008161] ===================================================
> > [    9.008163] DEPT: Circular dependency has been detected.
> > [    9.008164] 5.17.0-rc1-00015-gb94f67143867-dirty #2 Tainted: G        W
> > [    9.008166] ---------------------------------------------------
> > [    9.008167] summary
> > [    9.008167] ---------------------------------------------------
> > [    9.008168] *** DEADLOCK ***
> > [    9.008168]
> > [    9.008168] context A
> > [    9.008169]     [S] (unknown)(&(&journal->j_wait_transaction_locked)->dmap:0)
> > [    9.008171]     [W] wait(&(&journal->j_wait_commit)->dmap:0)
> > [    9.008172]     [E] event(&(&journal->j_wait_transaction_locked)->dmap:0)
> > [    9.008173]
> > [    9.008173] context B
> > [    9.008174]     [S] down_write(mapping.invalidate_lock:0)
> > [    9.008175]     [W] wait(&(&journal->j_wait_transaction_locked)->dmap:0)
> > [    9.008176]     [E] up_write(mapping.invalidate_lock:0)
> > [    9.008177]
> > [    9.008178] context C
> > [    9.008179]     [S] (unknown)(&(&journal->j_wait_commit)->dmap:0)
> > [    9.008180]     [W] down_write(mapping.invalidate_lock:0)
> > [    9.008181]     [E] event(&(&journal->j_wait_commit)->dmap:0)
> > [    9.008181]
> > [    9.008182] [S]: start of the event context
> > [    9.008183] [W]: the wait blocked
> > [    9.008183] [E]: the event not reachable
> 
> So what situation is your tool complaining about here? Can you perhaps show
> it here in more common visualization like:

Sure.

> TASK1				TASK2
> 				does foo, grabs Z
> does X, grabs lock Y
> blocks on Z
> 				blocks on Y
> 
> or something like that? Because I was not able to decipher this from the
> report even after trying for some time...

KJOURNALD2(kthread)	TASK1(ksys_write)	TASK2(ksys_write)

wait A
--- stuck
			wait B
			--- stuck
						wait C
						--- stuck

wake up B		wake up C		wake up A

where:
A is a wait_queue, j_wait_commit
B is a wait_queue, j_wait_transaction_locked
C is a rwsem, mapping.invalidate_lock

The above is the simplest form. And it's worth noting that Dept focuses
on wait and event itself rather than grabing and releasing things like
lock. The following is the more descriptive form of it.

KJOURNALD2(kthread)	TASK1(ksys_write)	TASK2(ksys_write)

wait @j_wait_commit
			ext4_truncate_failed_write()
			   down_write(mapping.invalidate_lock)

			   ext4_truncate()
			      ...
			      wait @j_wait_transaction_locked

						ext_truncate_failed_write()
						   down_write(mapping.invalidate_lock)

						ext4_should_retry_alloc()
						   ...
						   __jbd2_log_start_commit()
						      wake_up(j_wait_commit)
jbd2_journal_commit_transaction()
   wake_up(j_wait_transaction_locked)
			   up_write(mapping.invalidate_lock)

I hope this would help you understand the report.

Yeah... This is what Dept complained. And as Ted said, the kthread would
be woken up by another wakeup. So it's not deadlock deadlock. However,
these three threads and any other tasks waiting for any of the events A,
B, C would be struck for a while until the wakeup eventually wakes up
the kthread, kjournald2.

I guess it's not what ext4 is meant to do. Honestly, ext4 and journal
system look so complicated that I'm not convinced tho...

Thanks,
Byungchul

> 
> 								Honza
> 
> 				
> 
> > [    9.008184] ---------------------------------------------------
> > [    9.008184] context A's detail
> > [    9.008185] ---------------------------------------------------
> > [    9.008186] context A
> > [    9.008186]     [S] (unknown)(&(&journal->j_wait_transaction_locked)->dmap:0)
> > [    9.008187]     [W] wait(&(&journal->j_wait_commit)->dmap:0)
> > [    9.008188]     [E] event(&(&journal->j_wait_transaction_locked)->dmap:0)
> > [    9.008189]
> > [    9.008190] [S] (unknown)(&(&journal->j_wait_transaction_locked)->dmap:0):
> > [    9.008191] (N/A)
> > [    9.008191]
> > [    9.008192] [W] wait(&(&journal->j_wait_commit)->dmap:0):
> > [    9.008193] prepare_to_wait (kernel/sched/wait.c:275) 
> > [    9.008197] stacktrace:
> > [    9.008198] __schedule (kernel/sched/sched.h:1318 kernel/sched/sched.h:1616 kernel/sched/core.c:6213) 
> > [    9.008200] schedule (kernel/sched/core.c:6373 (discriminator 1)) 
> > [    9.008201] kjournald2 (fs/jbd2/journal.c:250) 
> > [    9.008203] kthread (kernel/kthread.c:377) 
> > [    9.008206] ret_from_fork (arch/x86/entry/entry_64.S:301) 
> > [    9.008209]
> > [    9.008209] [E] event(&(&journal->j_wait_transaction_locked)->dmap:0):
> > [    9.008210] __wake_up_common (kernel/sched/wait.c:108) 
> > [    9.008212] stacktrace:
> > [    9.008213] dept_event (kernel/dependency/dept.c:2337) 
> > [    9.008215] __wake_up_common (kernel/sched/wait.c:109) 
> > [    9.008217] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1)) 
> > [    9.008218] jbd2_journal_commit_transaction (fs/jbd2/commit.c:583) 
> > [    9.008221] kjournald2 (fs/jbd2/journal.c:214 (discriminator 3)) 
> > [    9.008223] kthread (kernel/kthread.c:377) 
> > [    9.008224] ret_from_fork (arch/x86/entry/entry_64.S:301) 
> > [    9.008226] ---------------------------------------------------
> > [    9.008226] context B's detail
> > [    9.008227] ---------------------------------------------------
> > [    9.008228] context B
> > [    9.008228]     [S] down_write(mapping.invalidate_lock:0)
> > [    9.008229]     [W] wait(&(&journal->j_wait_transaction_locked)->dmap:0)
> > [    9.008230]     [E] up_write(mapping.invalidate_lock:0)
> > [    9.008231]
> > [    9.008232] [S] down_write(mapping.invalidate_lock:0):
> > [    9.008233] ext4_da_write_begin (fs/ext4/truncate.h:21 fs/ext4/inode.c:2963) 
> > [    9.008237] stacktrace:
> > [    9.008237] down_write (kernel/locking/rwsem.c:1514) 
> > [    9.008239] ext4_da_write_begin (fs/ext4/truncate.h:21 fs/ext4/inode.c:2963) 
> > [    9.008241] generic_perform_write (mm/filemap.c:3784) 
> > [    9.008243] ext4_buffered_write_iter (fs/ext4/file.c:269) 
> > [    9.008245] ext4_file_write_iter (fs/ext4/file.c:677) 
> > [    9.008247] new_sync_write (fs/read_write.c:504 (discriminator 1)) 
> > [    9.008250] vfs_write (fs/read_write.c:590) 
> > [    9.008251] ksys_write (fs/read_write.c:644) 
> > [    9.008253] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) 
> > [    9.008255] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) 
> > [    9.008258]
> > [    9.008258] [W] wait(&(&journal->j_wait_transaction_locked)->dmap:0):
> > [    9.008259] prepare_to_wait (kernel/sched/wait.c:275) 
> > [    9.008261] stacktrace:
> > [    9.008261] __schedule (kernel/sched/sched.h:1318 kernel/sched/sched.h:1616 kernel/sched/core.c:6213) 
> > [    9.008263] schedule (kernel/sched/core.c:6373 (discriminator 1)) 
> > [    9.008264] wait_transaction_locked (fs/jbd2/transaction.c:184) 
> > [    9.008266] add_transaction_credits (fs/jbd2/transaction.c:248 (discriminator 3)) 
> > [    9.008267] start_this_handle (fs/jbd2/transaction.c:427) 
> > [    9.008269] jbd2__journal_start (fs/jbd2/transaction.c:526) 
> > [    9.008271] __ext4_journal_start_sb (fs/ext4/ext4_jbd2.c:105) 
> > [    9.008273] ext4_truncate (fs/ext4/inode.c:4164) 
> > [    9.008274] ext4_da_write_begin (./include/linux/fs.h:827 fs/ext4/truncate.h:23 fs/ext4/inode.c:2963) 
> > [    9.008276] generic_perform_write (mm/filemap.c:3784) 
> > [    9.008277] ext4_buffered_write_iter (fs/ext4/file.c:269) 
> > [    9.008279] ext4_file_write_iter (fs/ext4/file.c:677) 
> > [    9.008281] new_sync_write (fs/read_write.c:504 (discriminator 1)) 
> > [    9.008283] vfs_write (fs/read_write.c:590) 
> > [    9.008284] ksys_write (fs/read_write.c:644) 
> > [    9.008285] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) 
> > [    9.008287]
> > [    9.008288] [E] up_write(mapping.invalidate_lock:0):
> > [    9.008288] ext4_da_get_block_prep (fs/ext4/inode.c:1795 fs/ext4/inode.c:1829) 
> > [    9.008291] ---------------------------------------------------
> > [    9.008291] context C's detail
> > [    9.008292] ---------------------------------------------------
> > [    9.008292] context C
> > [    9.008293]     [S] (unknown)(&(&journal->j_wait_commit)->dmap:0)
> > [    9.008294]     [W] down_write(mapping.invalidate_lock:0)
> > [    9.008295]     [E] event(&(&journal->j_wait_commit)->dmap:0)
> > [    9.008296]
> > [    9.008297] [S] (unknown)(&(&journal->j_wait_commit)->dmap:0):
> > [    9.008298] (N/A)
> > [    9.008298]
> > [    9.008299] [W] down_write(mapping.invalidate_lock:0):
> > [    9.008299] ext4_da_write_begin (fs/ext4/truncate.h:21 fs/ext4/inode.c:2963) 
> > [    9.008302] stacktrace:
> > [    9.008302] down_write (kernel/locking/rwsem.c:1514) 
> > [    9.008304] ext4_da_write_begin (fs/ext4/truncate.h:21 fs/ext4/inode.c:2963) 
> > [    9.008305] generic_perform_write (mm/filemap.c:3784) 
> > [    9.008307] ext4_buffered_write_iter (fs/ext4/file.c:269) 
> > [    9.008309] ext4_file_write_iter (fs/ext4/file.c:677) 
> > [    9.008311] new_sync_write (fs/read_write.c:504 (discriminator 1)) 
> > [    9.008312] vfs_write (fs/read_write.c:590) 
> > [    9.008314] ksys_write (fs/read_write.c:644) 
> > [    9.008315] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) 
> > [    9.008316] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) 
> > [    9.008318]
> > [    9.008319] [E] event(&(&journal->j_wait_commit)->dmap:0):
> > [    9.008320] __wake_up_common (kernel/sched/wait.c:108) 
> > [    9.008321] stacktrace:
> > [    9.008322] __wake_up_common (kernel/sched/wait.c:109) 
> > [    9.008323] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1)) 
> > [    9.008324] __jbd2_log_start_commit (fs/jbd2/journal.c:508) 
> > [    9.008326] jbd2_log_start_commit (fs/jbd2/journal.c:527) 
> > [    9.008327] __jbd2_journal_force_commit (fs/jbd2/journal.c:560) 
> > [    9.008329] jbd2_journal_force_commit_nested (fs/jbd2/journal.c:583) 
> > [    9.008331] ext4_should_retry_alloc (fs/ext4/balloc.c:670 (discriminator 3)) 
> > [    9.008332] ext4_da_write_begin (fs/ext4/inode.c:2965 (discriminator 1)) 
> > [    9.008334] generic_perform_write (mm/filemap.c:3784) 
> > [    9.008335] ext4_buffered_write_iter (fs/ext4/file.c:269) 
> > [    9.008337] ext4_file_write_iter (fs/ext4/file.c:677) 
> > [    9.008339] new_sync_write (fs/read_write.c:504 (discriminator 1)) 
> > [    9.008341] vfs_write (fs/read_write.c:590) 
> > [    9.008342] ksys_write (fs/read_write.c:644) 
> > [    9.008343] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) 
> > [    9.008345] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) 
> > [    9.008347] ---------------------------------------------------
> > [    9.008348] information that might be helpful
> > [    9.008348] ---------------------------------------------------
> > [    9.008349] CPU: 0 PID: 89 Comm: jbd2/sda1-8 Tainted: G        W         5.17.0-rc1-00015-gb94f67143867-dirty #2
> > [    9.008352] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> > [    9.008353] Call Trace:
> > [    9.008354]  <TASK>
> > [    9.008355] dump_stack_lvl (lib/dump_stack.c:107) 
> > [    9.008358] print_circle (./arch/x86/include/asm/atomic.h:108 ./include/linux/atomic/atomic-instrumented.h:258 kernel/dependency/dept.c:157 kernel/dependency/dept.c:762) 
> > [    9.008360] ? print_circle (kernel/dependency/dept.c:1086) 
> > [    9.008362] cb_check_dl (kernel/dependency/dept.c:1104) 
> > [    9.008364] bfs (kernel/dependency/dept.c:860) 
> > [    9.008366] add_dep (kernel/dependency/dept.c:1423) 
> > [    9.008368] do_event.isra.25 (kernel/dependency/dept.c:1651) 
> > [    9.008370] ? __wake_up_common (kernel/sched/wait.c:108) 
> > [    9.008372] dept_event (kernel/dependency/dept.c:2337) 
> > [    9.008374] __wake_up_common (kernel/sched/wait.c:109) 
> > [    9.008376] __wake_up_common_lock (./include/linux/spinlock.h:428 (discriminator 1) kernel/sched/wait.c:141 (discriminator 1)) 
> > [    9.008379] jbd2_journal_commit_transaction (fs/jbd2/commit.c:583) 
> > [    9.008381] ? arch_stack_walk (arch/x86/kernel/stacktrace.c:24) 
> > [    9.008385] ? ret_from_fork (arch/x86/entry/entry_64.S:301) 
> > [    9.008387] ? dept_enable_hardirq (./arch/x86/include/asm/current.h:15 kernel/dependency/dept.c:241 kernel/dependency/dept.c:999 kernel/dependency/dept.c:1043 kernel/dependency/dept.c:1843) 
> > [    9.008389] ? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/irqflags.h:45 ./arch/x86/include/asm/irqflags.h:80 ./arch/x86/include/asm/irqflags.h:138 ./include/linux/spinlock_api_smp.h:151 kernel/locking/spinlock.c:194) 
> > [    9.008392] ? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/preempt.h:103 ./include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:194) 
> > [    9.008394] ? try_to_del_timer_sync (kernel/time/timer.c:1239) 
> > [    9.008396] kjournald2 (fs/jbd2/journal.c:214 (discriminator 3)) 
> > [    9.008398] ? prepare_to_wait_exclusive (kernel/sched/wait.c:431) 
> > [    9.008400] ? commit_timeout (fs/jbd2/journal.c:173) 
> > [    9.008402] kthread (kernel/kthread.c:377) 
> > [    9.008404] ? kthread_complete_and_exit (kernel/kthread.c:332) 
> > [    9.008407] ret_from_fork (arch/x86/entry/entry_64.S:301) 
> > [    9.008410]  </TASK>
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

Byungchul Park Feb. 23, 2022, 1:40 a.m. UTC | #16

On Tue, Feb 22, 2022 at 09:27:23AM +0100, Jan Kara wrote:
> On Thu 17-02-22 20:10:03, Byungchul Park wrote:
> > [    7.009608] ===================================================
> > [    7.009613] DEPT: Circular dependency has been detected.
> > [    7.009614] 5.17.0-rc1-00014-g8a599299c0cb-dirty #30 Tainted: G        W
> > [    7.009616] ---------------------------------------------------
> > [    7.009617] summary
> > [    7.009618] ---------------------------------------------------
> > [    7.009618] *** DEADLOCK ***
> > [    7.009618]
> > [    7.009619] context A
> > [    7.009619]     [S] (unknown)(&(bit_wait_table + i)->dmap:0)
> > [    7.009621]     [W] down_write(&ei->i_data_sem:0)
> > [    7.009623]     [E] event(&(bit_wait_table + i)->dmap:0)
> > [    7.009624]
> > [    7.009625] context B
> > [    7.009625]     [S] down_read(&ei->i_data_sem:0)
> > [    7.009626]     [W] wait(&(bit_wait_table + i)->dmap:0)
> > [    7.009627]     [E] up_read(&ei->i_data_sem:0)
> > [    7.009628]
> 
> Looking into this I have noticed that Dept here tracks bitlocks (buffer
> locks in particular) but it apparently treats locks on all buffers as one
> locking class so it conflates lock on superblock buffer with a lock on
> extent tree block buffer. These are wastly different locks with different
> locking constraints. So to avoid false positives in filesystems we will
> need to add annotations to differentiate locks on different buffers (based
> on what the block is used for). Similarly how we e.g. annotate i_rwsem for

Exactly yes. All synchronization objects should be classfied by what it
is used for. Even though it's already classified by the location of the
code initializing the object - roughly and normally saying we can expect
those have the same constraint, we are actually assigning different
constraints according to the subtle design esp. in file systems.

It would also help the code have better documentation ;-) I'm willing to
add annotations for that to fs.

> different inodes.
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

Byungchul Park Feb. 23, 2022, 3:30 a.m. UTC | #17

On Tue, Feb 22, 2022 at 09:27:23AM +0100, Jan Kara wrote:
> On Thu 17-02-22 20:10:03, Byungchul Park wrote:
> > [    7.009608] ===================================================
> > [    7.009613] DEPT: Circular dependency has been detected.
> > [    7.009614] 5.17.0-rc1-00014-g8a599299c0cb-dirty #30 Tainted: G        W
> > [    7.009616] ---------------------------------------------------
> > [    7.009617] summary
> > [    7.009618] ---------------------------------------------------
> > [    7.009618] *** DEADLOCK ***
> > [    7.009618]
> > [    7.009619] context A
> > [    7.009619]     [S] (unknown)(&(bit_wait_table + i)->dmap:0)
> > [    7.009621]     [W] down_write(&ei->i_data_sem:0)
> > [    7.009623]     [E] event(&(bit_wait_table + i)->dmap:0)
> > [    7.009624]
> > [    7.009625] context B
> > [    7.009625]     [S] down_read(&ei->i_data_sem:0)
> > [    7.009626]     [W] wait(&(bit_wait_table + i)->dmap:0)
> > [    7.009627]     [E] up_read(&ei->i_data_sem:0)
> > [    7.009628]
> 
> Looking into this I have noticed that Dept here tracks bitlocks (buffer
> locks in particular) but it apparently treats locks on all buffers as one
> locking class so it conflates lock on superblock buffer with a lock on
> extent tree block buffer. These are wastly different locks with different
> locking constraints. So to avoid false positives in filesystems we will
> need to add annotations to differentiate locks on different buffers (based
> on what the block is used for). Similarly how we e.g. annotate i_rwsem for
> different inodes.

Hi Jan Kara,

I just understood why some guys in this space got mad at Dept reports.
I barely got reports from the lock you mentioned with my system -
precisely speaking only one, even though I've been rebooting my system
many times. But another report that someone gave for me showed there
were a lot of reports from the lock.

Your comment and the report are so much helpful. I need to assign
each's own class first for the buffer locks. Thank you very much.

Thanks,
Byungchul

> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

Jan Kara Feb. 23, 2022, 2:48 p.m. UTC | #18

On Wed 23-02-22 09:35:34, Byungchul Park wrote:
> On Mon, Feb 21, 2022 at 08:02:04PM +0100, Jan Kara wrote:
> > On Thu 17-02-22 20:10:04, Byungchul Park wrote:
> > > [    9.008161] ===================================================
> > > [    9.008163] DEPT: Circular dependency has been detected.
> > > [    9.008164] 5.17.0-rc1-00015-gb94f67143867-dirty #2 Tainted: G        W
> > > [    9.008166] ---------------------------------------------------
> > > [    9.008167] summary
> > > [    9.008167] ---------------------------------------------------
> > > [    9.008168] *** DEADLOCK ***
> > > [    9.008168]
> > > [    9.008168] context A
> > > [    9.008169]     [S] (unknown)(&(&journal->j_wait_transaction_locked)->dmap:0)
> > > [    9.008171]     [W] wait(&(&journal->j_wait_commit)->dmap:0)
> > > [    9.008172]     [E] event(&(&journal->j_wait_transaction_locked)->dmap:0)
> > > [    9.008173]
> > > [    9.008173] context B
> > > [    9.008174]     [S] down_write(mapping.invalidate_lock:0)
> > > [    9.008175]     [W] wait(&(&journal->j_wait_transaction_locked)->dmap:0)
> > > [    9.008176]     [E] up_write(mapping.invalidate_lock:0)
> > > [    9.008177]
> > > [    9.008178] context C
> > > [    9.008179]     [S] (unknown)(&(&journal->j_wait_commit)->dmap:0)
> > > [    9.008180]     [W] down_write(mapping.invalidate_lock:0)
> > > [    9.008181]     [E] event(&(&journal->j_wait_commit)->dmap:0)
> > > [    9.008181]
> > > [    9.008182] [S]: start of the event context
> > > [    9.008183] [W]: the wait blocked
> > > [    9.008183] [E]: the event not reachable
> > 
> > So what situation is your tool complaining about here? Can you perhaps show
> > it here in more common visualization like:
> 
> Sure.
> 
> > TASK1				TASK2
> > 				does foo, grabs Z
> > does X, grabs lock Y
> > blocks on Z
> > 				blocks on Y
> > 
> > or something like that? Because I was not able to decipher this from the
> > report even after trying for some time...
> 
> KJOURNALD2(kthread)	TASK1(ksys_write)	TASK2(ksys_write)
> 
> wait A
> --- stuck
> 			wait B
> 			--- stuck
> 						wait C
> 						--- stuck
> 
> wake up B		wake up C		wake up A
> 
> where:
> A is a wait_queue, j_wait_commit
> B is a wait_queue, j_wait_transaction_locked
> C is a rwsem, mapping.invalidate_lock

I see. But a situation like this is not necessarily a guarantee of a
deadlock, is it? I mean there can be task D that will eventually call say
'wake up B' and unblock everything and this is how things were designed to
work? Multiple sources of wakeups are quite common I'd say... What does
Dept do to prevent false reports in cases like this?

> The above is the simplest form. And it's worth noting that Dept focuses
> on wait and event itself rather than grabing and releasing things like
> lock. The following is the more descriptive form of it.
> 
> KJOURNALD2(kthread)	TASK1(ksys_write)	TASK2(ksys_write)
> 
> wait @j_wait_commit
> 			ext4_truncate_failed_write()
> 			   down_write(mapping.invalidate_lock)
> 
> 			   ext4_truncate()
> 			      ...
> 			      wait @j_wait_transaction_locked
> 
> 						ext_truncate_failed_write()
> 						   down_write(mapping.invalidate_lock)
> 
> 						ext4_should_retry_alloc()
> 						   ...
> 						   __jbd2_log_start_commit()
> 						      wake_up(j_wait_commit)
> jbd2_journal_commit_transaction()
>    wake_up(j_wait_transaction_locked)
> 			   up_write(mapping.invalidate_lock)
> 
> I hope this would help you understand the report.

I see, thanks for explanation! So the above scenario is impossible because
for anyone to block on @j_wait_transaction_locked the transaction must be
committing, which is done only by kjournald2 kthread and so that thread
cannot be waiting at @j_wait_commit. Essentially blocking on
@j_wait_transaction_locked means @j_wait_commit wakeup was already done.

I guess this shows there can be non-trivial dependencies between wait
queues which are difficult to track in an automated way and without such
tracking we are going to see false positives...

								Honza

Byungchul Park Feb. 24, 2022, 1:11 a.m. UTC | #19

On Wed, Feb 23, 2022 at 03:48:59PM +0100, Jan Kara wrote:
> On Wed 23-02-22 09:35:34, Byungchul Park wrote:
> > On Mon, Feb 21, 2022 at 08:02:04PM +0100, Jan Kara wrote:
> > > On Thu 17-02-22 20:10:04, Byungchul Park wrote:
> > > > [    9.008161] ===================================================
> > > > [    9.008163] DEPT: Circular dependency has been detected.
> > > > [    9.008164] 5.17.0-rc1-00015-gb94f67143867-dirty #2 Tainted: G        W
> > > > [    9.008166] ---------------------------------------------------
> > > > [    9.008167] summary
> > > > [    9.008167] ---------------------------------------------------
> > > > [    9.008168] *** DEADLOCK ***
> > > > [    9.008168]
> > > > [    9.008168] context A
> > > > [    9.008169]     [S] (unknown)(&(&journal->j_wait_transaction_locked)->dmap:0)
> > > > [    9.008171]     [W] wait(&(&journal->j_wait_commit)->dmap:0)
> > > > [    9.008172]     [E] event(&(&journal->j_wait_transaction_locked)->dmap:0)
> > > > [    9.008173]
> > > > [    9.008173] context B
> > > > [    9.008174]     [S] down_write(mapping.invalidate_lock:0)
> > > > [    9.008175]     [W] wait(&(&journal->j_wait_transaction_locked)->dmap:0)
> > > > [    9.008176]     [E] up_write(mapping.invalidate_lock:0)
> > > > [    9.008177]
> > > > [    9.008178] context C
> > > > [    9.008179]     [S] (unknown)(&(&journal->j_wait_commit)->dmap:0)
> > > > [    9.008180]     [W] down_write(mapping.invalidate_lock:0)
> > > > [    9.008181]     [E] event(&(&journal->j_wait_commit)->dmap:0)
> > > > [    9.008181]
> > > > [    9.008182] [S]: start of the event context
> > > > [    9.008183] [W]: the wait blocked
> > > > [    9.008183] [E]: the event not reachable
> > > 
> > > So what situation is your tool complaining about here? Can you perhaps show
> > > it here in more common visualization like:
> > 
> > Sure.
> > 
> > > TASK1				TASK2
> > > 				does foo, grabs Z
> > > does X, grabs lock Y
> > > blocks on Z
> > > 				blocks on Y
> > > 
> > > or something like that? Because I was not able to decipher this from the
> > > report even after trying for some time...
> > 
> > KJOURNALD2(kthread)	TASK1(ksys_write)	TASK2(ksys_write)
> > 
> > wait A
> > --- stuck
> > 			wait B
> > 			--- stuck
> > 						wait C
> > 						--- stuck
> > 
> > wake up B		wake up C		wake up A
> > 
> > where:
> > A is a wait_queue, j_wait_commit
> > B is a wait_queue, j_wait_transaction_locked
> > C is a rwsem, mapping.invalidate_lock
> 
> I see. But a situation like this is not necessarily a guarantee of a
> deadlock, is it? I mean there can be task D that will eventually call say
> 'wake up B' and unblock everything and this is how things were designed to
> work? Multiple sources of wakeups are quite common I'd say... What does

Yes. At the very beginning when I desgined Dept, I was thinking whether
to support multiple wakeup sources or not for a quite long time.
Supporting it would be a better option to aovid non-critical reports.
However, I thought anyway we'd better fix it - not urgent tho - if
there's any single circle dependency. That's why I decided not to
support it for now and wanted to gather the kernel guys' opinions. Thing
is which policy we should go with.

> Dept do to prevent false reports in cases like this?
> 
> > The above is the simplest form. And it's worth noting that Dept focuses
> > on wait and event itself rather than grabing and releasing things like
> > lock. The following is the more descriptive form of it.
> > 
> > KJOURNALD2(kthread)	TASK1(ksys_write)	TASK2(ksys_write)
> > 
> > wait @j_wait_commit
> > 			ext4_truncate_failed_write()
> > 			   down_write(mapping.invalidate_lock)
> > 
> > 			   ext4_truncate()
> > 			      ...
> > 			      wait @j_wait_transaction_locked
> > 
> > 						ext_truncate_failed_write()
> > 						   down_write(mapping.invalidate_lock)
> > 
> > 						ext4_should_retry_alloc()
> > 						   ...
> > 						   __jbd2_log_start_commit()
> > 						      wake_up(j_wait_commit)
> > jbd2_journal_commit_transaction()
> >    wake_up(j_wait_transaction_locked)
> > 			   up_write(mapping.invalidate_lock)
> > 
> > I hope this would help you understand the report.
> 
> I see, thanks for explanation! So the above scenario is impossible because

My pleasure.

> for anyone to block on @j_wait_transaction_locked the transaction must be
> committing, which is done only by kjournald2 kthread and so that thread
> cannot be waiting at @j_wait_commit. Essentially blocking on
> @j_wait_transaction_locked means @j_wait_commit wakeup was already done.

kjournal2 repeatedly does the wait and the wake_up so the above scenario
looks possible to me even based on what you explained. Maybe I should
understand how the journal things work more for furhter discussion. Your
explanation is so helpful. Thank you really.

Thanks,
Byungchul

> I guess this shows there can be non-trivial dependencies between wait
> queues which are difficult to track in an automated way and without such
> tracking we are going to see false positives...
> 
> 								Honza
> 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

Jan Kara Feb. 24, 2022, 10:22 a.m. UTC | #20

On Thu 24-02-22 10:11:02, Byungchul Park wrote:
> On Wed, Feb 23, 2022 at 03:48:59PM +0100, Jan Kara wrote:
> > > KJOURNALD2(kthread)	TASK1(ksys_write)	TASK2(ksys_write)
> > > 
> > > wait A
> > > --- stuck
> > > 			wait B
> > > 			--- stuck
> > > 						wait C
> > > 						--- stuck
> > > 
> > > wake up B		wake up C		wake up A
> > > 
> > > where:
> > > A is a wait_queue, j_wait_commit
> > > B is a wait_queue, j_wait_transaction_locked
> > > C is a rwsem, mapping.invalidate_lock
> > 
> > I see. But a situation like this is not necessarily a guarantee of a
> > deadlock, is it? I mean there can be task D that will eventually call say
> > 'wake up B' and unblock everything and this is how things were designed to
> > work? Multiple sources of wakeups are quite common I'd say... What does
> 
> Yes. At the very beginning when I desgined Dept, I was thinking whether
> to support multiple wakeup sources or not for a quite long time.
> Supporting it would be a better option to aovid non-critical reports.
> However, I thought anyway we'd better fix it - not urgent tho - if
> there's any single circle dependency. That's why I decided not to
> support it for now and wanted to gather the kernel guys' opinions. Thing
> is which policy we should go with.

I see. So supporting only a single wakeup source is fine for locks I guess.
But for general wait queues or other synchronization mechanisms, I'm afraid
it will lead to quite some false positive reports. Just my 2c.

> > Dept do to prevent false reports in cases like this?
> > 
> > > The above is the simplest form. And it's worth noting that Dept focuses
> > > on wait and event itself rather than grabing and releasing things like
> > > lock. The following is the more descriptive form of it.
> > > 
> > > KJOURNALD2(kthread)	TASK1(ksys_write)	TASK2(ksys_write)
> > > 
> > > wait @j_wait_commit
> > > 			ext4_truncate_failed_write()
> > > 			   down_write(mapping.invalidate_lock)
> > > 
> > > 			   ext4_truncate()
> > > 			      ...
> > > 			      wait @j_wait_transaction_locked
> > > 
> > > 						ext_truncate_failed_write()
> > > 						   down_write(mapping.invalidate_lock)
> > > 
> > > 						ext4_should_retry_alloc()
> > > 						   ...
> > > 						   __jbd2_log_start_commit()
> > > 						      wake_up(j_wait_commit)
> > > jbd2_journal_commit_transaction()
> > >    wake_up(j_wait_transaction_locked)
> > > 			   up_write(mapping.invalidate_lock)
> > > 
> > > I hope this would help you understand the report.
> > 
> > I see, thanks for explanation! So the above scenario is impossible because
> 
> My pleasure.
> 
> > for anyone to block on @j_wait_transaction_locked the transaction must be
> > committing, which is done only by kjournald2 kthread and so that thread
> > cannot be waiting at @j_wait_commit. Essentially blocking on
> > @j_wait_transaction_locked means @j_wait_commit wakeup was already done.
> 
> kjournal2 repeatedly does the wait and the wake_up so the above scenario
> looks possible to me even based on what you explained. Maybe I should
> understand how the journal things work more for furhter discussion. Your
> explanation is so helpful. Thank you really.

OK, let me provide you with more details for better understanding :) In
jbd2 we have an object called 'transaction'. This object can go through
many states but for our case is important that transaction is moved to
T_LOCKED state and out of it only while jbd2_journal_commit_transaction()
function is executing and waiting on j_wait_transaction_locked waitqueue is
exactly waiting for a transaction to get out of T_LOCKED state. Function
jbd2_journal_commit_transaction() is executed only by kjournald. Hence
anyone can see transaction in T_LOCKED state only if kjournald is running
inside jbd2_journal_commit_transaction() and thus kjournald cannot be
sleeping on j_wait_commit at the same time. Does this explain things?

								Honza

Byungchul Park Feb. 28, 2022, 9:28 a.m. UTC | #21

On Thu, Feb 24, 2022 at 11:22:39AM +0100, Jan Kara wrote:
> On Thu 24-02-22 10:11:02, Byungchul Park wrote:
> > On Wed, Feb 23, 2022 at 03:48:59PM +0100, Jan Kara wrote:
> > > > KJOURNALD2(kthread)	TASK1(ksys_write)	TASK2(ksys_write)
> > > > 
> > > > wait A
> > > > --- stuck
> > > > 			wait B
> > > > 			--- stuck
> > > > 						wait C
> > > > 						--- stuck
> > > > 
> > > > wake up B		wake up C		wake up A
> > > > 
> > > > where:
> > > > A is a wait_queue, j_wait_commit
> > > > B is a wait_queue, j_wait_transaction_locked
> > > > C is a rwsem, mapping.invalidate_lock
> > > 
> > > I see. But a situation like this is not necessarily a guarantee of a
> > > deadlock, is it? I mean there can be task D that will eventually call say
> > > 'wake up B' and unblock everything and this is how things were designed to
> > > work? Multiple sources of wakeups are quite common I'd say... What does
> > 
> > Yes. At the very beginning when I desgined Dept, I was thinking whether
> > to support multiple wakeup sources or not for a quite long time.
> > Supporting it would be a better option to aovid non-critical reports.
> > However, I thought anyway we'd better fix it - not urgent tho - if
> > there's any single circle dependency. That's why I decided not to
> > support it for now and wanted to gather the kernel guys' opinions. Thing
> > is which policy we should go with.
> 
> I see. So supporting only a single wakeup source is fine for locks I guess.
> But for general wait queues or other synchronization mechanisms, I'm afraid
> it will lead to quite some false positive reports. Just my 2c.

Thank you for your feedback.

I realized we've been using "false positive" differently. There exist
the three types of code in terms of dependency and deadlock. It's worth
noting that dependencies are built from between waits and events in Dept.

---

case 1. Code with an actual circular dependency, but not deadlock.

   A circular dependency can be broken by a rescue wakeup source e.g.
   timeout. It's not a deadlock. If it's okay that the contexts
   participating in the circular dependency and others waiting for the
   events in the circle are stuck until it gets broken. Otherwise, say,
   if it's not meant, then it's anyway problematic.

   1-1. What if we judge this code is problematic?
   1-2. What if we judge this code is good?

case 2. Code with an actual circular dependency, and deadlock.

   There's no other wakeup source than those within the circular
   dependency. Literally deadlock. It's problematic and critical.

   2-1. What if we judge this code is problematic?
   2-2. What if we judge this code is good?

case 3. Code with no actual circular dependency, and not deadlock.

   Must be good.

   3-1. What if we judge this code is problematic?
   3-2. What if we judge this code is good?

---

I call only 3-1 "false positive" circular dependency. And you call 1-1
and 3-1 "false positive" deadlock.

I've been wondering if the kernel guys esp. Linus considers code with
any circular dependency is problematic or not, even if it won't lead to
a deadlock, say, case 1. Even though I designed Dept based on what I
believe is right, of course, I'm willing to change the design according
to the majority opinion.

However, I would never allow case 1 if I were the owner of the kernel
for better stability, even though the code works anyway okay for now.

Thanks,
Byungchul

> > > Dept do to prevent false reports in cases like this?
> > > 
> > > > The above is the simplest form. And it's worth noting that Dept focuses
> > > > on wait and event itself rather than grabing and releasing things like
> > > > lock. The following is the more descriptive form of it.
> > > > 
> > > > KJOURNALD2(kthread)	TASK1(ksys_write)	TASK2(ksys_write)
> > > > 
> > > > wait @j_wait_commit
> > > > 			ext4_truncate_failed_write()
> > > > 			   down_write(mapping.invalidate_lock)
> > > > 
> > > > 			   ext4_truncate()
> > > > 			      ...
> > > > 			      wait @j_wait_transaction_locked
> > > > 
> > > > 						ext_truncate_failed_write()
> > > > 						   down_write(mapping.invalidate_lock)
> > > > 
> > > > 						ext4_should_retry_alloc()
> > > > 						   ...
> > > > 						   __jbd2_log_start_commit()
> > > > 						      wake_up(j_wait_commit)
> > > > jbd2_journal_commit_transaction()
> > > >    wake_up(j_wait_transaction_locked)
> > > > 			   up_write(mapping.invalidate_lock)
> > > > 
> > > > I hope this would help you understand the report.
> > > 
> > > I see, thanks for explanation! So the above scenario is impossible because
> > 
> > My pleasure.
> > 
> > > for anyone to block on @j_wait_transaction_locked the transaction must be
> > > committing, which is done only by kjournald2 kthread and so that thread
> > > cannot be waiting at @j_wait_commit. Essentially blocking on
> > > @j_wait_transaction_locked means @j_wait_commit wakeup was already done.
> > 
> > kjournal2 repeatedly does the wait and the wake_up so the above scenario
> > looks possible to me even based on what you explained. Maybe I should
> > understand how the journal things work more for furhter discussion. Your
> > explanation is so helpful. Thank you really.
> 
> OK, let me provide you with more details for better understanding :) In
> jbd2 we have an object called 'transaction'. This object can go through
> many states but for our case is important that transaction is moved to
> T_LOCKED state and out of it only while jbd2_journal_commit_transaction()
> function is executing and waiting on j_wait_transaction_locked waitqueue is
> exactly waiting for a transaction to get out of T_LOCKED state. Function
> jbd2_journal_commit_transaction() is executed only by kjournald. Hence
> anyone can see transaction in T_LOCKED state only if kjournald is running
> inside jbd2_journal_commit_transaction() and thus kjournald cannot be
> sleeping on j_wait_commit at the same time. Does this explain things?
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

Jan Kara Feb. 28, 2022, 10:14 a.m. UTC | #22

On Mon 28-02-22 18:28:26, Byungchul Park wrote:
> On Thu, Feb 24, 2022 at 11:22:39AM +0100, Jan Kara wrote:
> > On Thu 24-02-22 10:11:02, Byungchul Park wrote:
> > > On Wed, Feb 23, 2022 at 03:48:59PM +0100, Jan Kara wrote:
> > > > > KJOURNALD2(kthread)	TASK1(ksys_write)	TASK2(ksys_write)
> > > > > 
> > > > > wait A
> > > > > --- stuck
> > > > > 			wait B
> > > > > 			--- stuck
> > > > > 						wait C
> > > > > 						--- stuck
> > > > > 
> > > > > wake up B		wake up C		wake up A
> > > > > 
> > > > > where:
> > > > > A is a wait_queue, j_wait_commit
> > > > > B is a wait_queue, j_wait_transaction_locked
> > > > > C is a rwsem, mapping.invalidate_lock
> > > > 
> > > > I see. But a situation like this is not necessarily a guarantee of a
> > > > deadlock, is it? I mean there can be task D that will eventually call say
> > > > 'wake up B' and unblock everything and this is how things were designed to
> > > > work? Multiple sources of wakeups are quite common I'd say... What does
> > > 
> > > Yes. At the very beginning when I desgined Dept, I was thinking whether
> > > to support multiple wakeup sources or not for a quite long time.
> > > Supporting it would be a better option to aovid non-critical reports.
> > > However, I thought anyway we'd better fix it - not urgent tho - if
> > > there's any single circle dependency. That's why I decided not to
> > > support it for now and wanted to gather the kernel guys' opinions. Thing
> > > is which policy we should go with.
> > 
> > I see. So supporting only a single wakeup source is fine for locks I guess.
> > But for general wait queues or other synchronization mechanisms, I'm afraid
> > it will lead to quite some false positive reports. Just my 2c.
> 
> Thank you for your feedback.
> 
> I realized we've been using "false positive" differently. There exist
> the three types of code in terms of dependency and deadlock. It's worth
> noting that dependencies are built from between waits and events in Dept.
> 
> ---
> 
> case 1. Code with an actual circular dependency, but not deadlock.
> 
>    A circular dependency can be broken by a rescue wakeup source e.g.
>    timeout. It's not a deadlock. If it's okay that the contexts
>    participating in the circular dependency and others waiting for the
>    events in the circle are stuck until it gets broken. Otherwise, say,
>    if it's not meant, then it's anyway problematic.
> 
>    1-1. What if we judge this code is problematic?
>    1-2. What if we judge this code is good?
> 
> case 2. Code with an actual circular dependency, and deadlock.
> 
>    There's no other wakeup source than those within the circular
>    dependency. Literally deadlock. It's problematic and critical.
> 
>    2-1. What if we judge this code is problematic?
>    2-2. What if we judge this code is good?
> 
> case 3. Code with no actual circular dependency, and not deadlock.
> 
>    Must be good.
> 
>    3-1. What if we judge this code is problematic?
>    3-2. What if we judge this code is good?
> 
> ---
> 
> I call only 3-1 "false positive" circular dependency. And you call 1-1
> and 3-1 "false positive" deadlock.
> 
> I've been wondering if the kernel guys esp. Linus considers code with
> any circular dependency is problematic or not, even if it won't lead to
> a deadlock, say, case 1. Even though I designed Dept based on what I
> believe is right, of course, I'm willing to change the design according
> to the majority opinion.
> 
> However, I would never allow case 1 if I were the owner of the kernel
> for better stability, even though the code works anyway okay for now.

So yes, I call a report for the situation "There is circular dependency but
deadlock is not possible." a false positive. And that is because in my
opinion your definition of circular dependency includes schemes that are
useful and used in the kernel.

Your example in case 1 is kind of borderline (I personally would consider
that bug as well) but there are other more valid schemes with multiple
wakeup sources like:

We have a queue of work to do Q protected by lock L. Consumer process has
code like:

while (1) {
	lock L
	prepare_to_wait(work_queued);
	if (no work) {
		unlock L
		sleep
	} else {
		unlock L
		do work
		wake_up(work_done)
	}
}

AFAIU Dept will create dependency here that 'wakeup work_done' is after
'wait for work_queued'. Producer has code like:

while (1) {
	lock L
	prepare_to_wait(work_done)
	if (too much work queued) {
		unlock L
		sleep
	} else {
		queue work
		unlock L
		wake_up(work_queued)
	}
}

And Dept will create dependency here that 'wakeup work_queued' is after
'wait for work_done'. And thus we have a trivial cycle in the dependencies
despite the code being perfectly valid and safe.

								Honza

Theodore Ts'o Feb. 28, 2022, 9:25 p.m. UTC | #23

On Mon, Feb 28, 2022 at 11:14:44AM +0100, Jan Kara wrote:
> > case 1. Code with an actual circular dependency, but not deadlock.
> > 
> >    A circular dependency can be broken by a rescue wakeup source e.g.
> >    timeout. It's not a deadlock. If it's okay that the contexts
> >    participating in the circular dependency and others waiting for the
> >    events in the circle are stuck until it gets broken. Otherwise, say,
> >    if it's not meant, then it's anyway problematic.
> > 
> >    1-1. What if we judge this code is problematic?
> >    1-2. What if we judge this code is good?
> > 
> > I've been wondering if the kernel guys esp. Linus considers code with
> > any circular dependency is problematic or not, even if it won't lead to
> > a deadlock, say, case 1. Even though I designed Dept based on what I
> > believe is right, of course, I'm willing to change the design according
> > to the majority opinion.
> > 
> > However, I would never allow case 1 if I were the owner of the kernel
> > for better stability, even though the code works anyway okay for now.

Note, I used the example of the timeout as the most obvious way of
explaining that a deadlock is not possible.  There is also the much
more complex explanation which Jan was trying to give, which is what
leads to the circular dependency.  It can happen that when trying to
start a handle, if either (a) there is not enough space in the journal
for new handles, or (b) the current transaction is so large that if we
don't close the transaction and start a new hone, we will end up
running out of space in the future, and so in that case,
start_this_handle() will block starting any more handles, and then
wake up the commit thread.  The commit thread then waits for the
currently running threads to complete, before it allows new handles to
start, and then it will complete the commit.  In the case of (a) we
then need to do a journal checkpoint, which is more work to release
space in the journal, and only then, can we allow new handles to start.

The botom line is (a) it works, (b) there aren't significant delays,
and for DEPT to complain that this is somehow wrong and we need to
completely rearchitect perfectly working code because it doesn't
confirm to DEPT's idea of what is "correct" is not acceptable.

> We have a queue of work to do Q protected by lock L. Consumer process has
> code like:
> 
> while (1) {
> 	lock L
> 	prepare_to_wait(work_queued);
> 	if (no work) {
> 		unlock L
> 		sleep
> 	} else {
> 		unlock L
> 		do work
> 		wake_up(work_done)
> 	}
> }
> 
> AFAIU Dept will create dependency here that 'wakeup work_done' is after
> 'wait for work_queued'. Producer has code like:
> 
> while (1) {
> 	lock L
> 	prepare_to_wait(work_done)
> 	if (too much work queued) {
> 		unlock L
> 		sleep
> 	} else {
> 		queue work
> 		unlock L
> 		wake_up(work_queued)
> 	}
> }
> 
> And Dept will create dependency here that 'wakeup work_queued' is after
> 'wait for work_done'. And thus we have a trivial cycle in the dependencies
> despite the code being perfectly valid and safe.

Cheers,

							- Ted

Byungchul Park March 3, 2022, 1 a.m. UTC | #24

On Mon, Feb 28, 2022 at 11:14:44AM +0100, Jan Kara wrote:
> On Mon 28-02-22 18:28:26, Byungchul Park wrote:
> > case 1. Code with an actual circular dependency, but not deadlock.
> > 
> >    A circular dependency can be broken by a rescue wakeup source e.g.
> >    timeout. It's not a deadlock. If it's okay that the contexts
> >    participating in the circular dependency and others waiting for the
> >    events in the circle are stuck until it gets broken. Otherwise, say,
> >    if it's not meant, then it's anyway problematic.
> > 
> >    1-1. What if we judge this code is problematic?
> >    1-2. What if we judge this code is good?
> > 
> > case 2. Code with an actual circular dependency, and deadlock.
> > 
> >    There's no other wakeup source than those within the circular
> >    dependency. Literally deadlock. It's problematic and critical.
> > 
> >    2-1. What if we judge this code is problematic?
> >    2-2. What if we judge this code is good?
> > 
> > case 3. Code with no actual circular dependency, and not deadlock.
> > 
> >    Must be good.
> > 
> >    3-1. What if we judge this code is problematic?
> >    3-2. What if we judge this code is good?
> > 
> > ---
> > 
> > I call only 3-1 "false positive" circular dependency. And you call 1-1
> > and 3-1 "false positive" deadlock.
> > 
> > I've been wondering if the kernel guys esp. Linus considers code with
> > any circular dependency is problematic or not, even if it won't lead to
> > a deadlock, say, case 1. Even though I designed Dept based on what I
> > believe is right, of course, I'm willing to change the design according
> > to the majority opinion.
> > 
> > However, I would never allow case 1 if I were the owner of the kernel
> > for better stability, even though the code works anyway okay for now.
> 
> So yes, I call a report for the situation "There is circular dependency but
> deadlock is not possible." a false positive. And that is because in my
> opinion your definition of circular dependency includes schemes that are
> useful and used in the kernel.
> 
> Your example in case 1 is kind of borderline (I personally would consider
> that bug as well) but there are other more valid schemes with multiple
> wakeup sources like:
> 
> We have a queue of work to do Q protected by lock L. Consumer process has
> code like:
> 
> while (1) {
> 	lock L
> 	prepare_to_wait(work_queued);
> 	if (no work) {
> 		unlock L
> 		sleep
> 	} else {
> 		unlock L
> 		do work
> 		wake_up(work_done)
> 	}
> }
> 
> AFAIU Dept will create dependency here that 'wakeup work_done' is after
> 'wait for work_queued'. Producer has code like:

First of all, thank you for this good example.

> while (1) {
> 	lock L
> 	prepare_to_wait(work_done)
> 	if (too much work queued) {
> 		unlock L
> 		sleep
> 	} else {
> 		queue work
> 		unlock L
> 		wake_up(work_queued)
> 	}
> }
> 
> And Dept will create dependency here that 'wakeup work_queued' is after
> 'wait for work_done'. And thus we have a trivial cycle in the dependencies
> despite the code being perfectly valid and safe.

Unfortunately, it's neither perfect nor safe without another wakeup
source - rescue wakeup source.

   consumer			producer

				lock L
				(too much work queued == true)
				unlock L
				--- preempted
   lock L
   unlock L
   do work
   lock L
   unlock L
   do work
   ...
   (no work == true)
   sleep
				--- scheduled in
				sleep

This code leads a deadlock without another wakeup source, say, not safe.

But yes. I also think this code should be allowed if it anyway runs
alongside another wakeup source. For the case, Dept should track the
rescue wakeup source instead that leads a actual deadlock.

I will correct code to make Dept track its rescue wakeup source whenever
finding the case.

Lastly, just for your information, I need to explain how Dept works a
little more for you not to misunderstand Dept.

Assuming the consumer and producer guarantee not to lead a deadlock like
the following, Dept won't report it a problem:

   consumer			producer

				sleep
   wakeup work_done
				queue work
   sleep
				wakeup work_queued
   do work
				sleep
   wakeup work_done
				queue work
   sleep
				wakeup work_queued
   do work
				sleep
   ...				...

Dept does not consider all waits preceeding an event but only waits that
might lead a deadlock. In this case, Dept works with each region
independently.

   consumer			producer

				sleep <- initiates region 1
   --- region 1 starts
   ...				...
   --- region 1 ends
   wakeup work_done
   ...				...
				queue work
   ...				...
   sleep <- initiates region 2
				--- region 2 starts
   ...				...
				--- region 2 ends
				wakeup work_queued
   ...				...
   do work
   ...				...
				sleep <- initiates region 3
   --- region 3 starts
   ...				...
   --- region 3 ends
   wakeup work_done
   ...				...
				queue work
   ...				...
   sleep <- initiates region 4
				--- region 4 starts
   ...				...
				--- region 4 ends
				wakeup work_queued
   ...				...
   do work
   ...				...

That is, Dept does not build dependencies across different regions. So
you don't have to worry about unreasonable false positives that much.

Thoughts?

Thanks,
Byungchul

> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

Byungchul Park March 3, 2022, 1:36 a.m. UTC | #25

On Mon, Feb 28, 2022 at 04:25:04PM -0500, Theodore Ts'o wrote:
> On Mon, Feb 28, 2022 at 11:14:44AM +0100, Jan Kara wrote:
> > > case 1. Code with an actual circular dependency, but not deadlock.
> > > 
> > >    A circular dependency can be broken by a rescue wakeup source e.g.
> > >    timeout. It's not a deadlock. If it's okay that the contexts
> > >    participating in the circular dependency and others waiting for the
> > >    events in the circle are stuck until it gets broken. Otherwise, say,
> > >    if it's not meant, then it's anyway problematic.
> > > 
> > >    1-1. What if we judge this code is problematic?
> > >    1-2. What if we judge this code is good?
> > > 
> > > I've been wondering if the kernel guys esp. Linus considers code with
> > > any circular dependency is problematic or not, even if it won't lead to
> > > a deadlock, say, case 1. Even though I designed Dept based on what I
> > > believe is right, of course, I'm willing to change the design according
> > > to the majority opinion.
> > > 
> > > However, I would never allow case 1 if I were the owner of the kernel
> > > for better stability, even though the code works anyway okay for now.
> 
> Note, I used the example of the timeout as the most obvious way of
> explaining that a deadlock is not possible.  There is also the much
> more complex explanation which Jan was trying to give, which is what
> leads to the circular dependency.  It can happen that when trying to
> start a handle, if either (a) there is not enough space in the journal
> for new handles, or (b) the current transaction is so large that if we
> don't close the transaction and start a new hone, we will end up
> running out of space in the future, and so in that case,
> start_this_handle() will block starting any more handles, and then
> wake up the commit thread.  The commit thread then waits for the
> currently running threads to complete, before it allows new handles to
> start, and then it will complete the commit.  In the case of (a) we
> then need to do a journal checkpoint, which is more work to release
> space in the journal, and only then, can we allow new handles to start.

Thank you for the full explanation of how journal things work.

> The botom line is (a) it works, (b) there aren't significant delays,
> and for DEPT to complain that this is somehow wrong and we need to
> completely rearchitect perfectly working code because it doesn't
> confirm to DEPT's idea of what is "correct" is not acceptable.

Thanks to you and Jan Kara, I realized it's not a real dependency in the
consumer and producer scenario but again *ONLY IF* there is a rescue
wakeup source. Dept should track the rescue wakeup source instead in the
case.

I won't ask you to rearchitect the working code. The code looks sane.

Thanks a lot.

Thanks,
Byungchul

> > We have a queue of work to do Q protected by lock L. Consumer process has
> > code like:
> > 
> > while (1) {
> > 	lock L
> > 	prepare_to_wait(work_queued);
> > 	if (no work) {
> > 		unlock L
> > 		sleep
> > 	} else {
> > 		unlock L
> > 		do work
> > 		wake_up(work_done)
> > 	}
> > }
> > 
> > AFAIU Dept will create dependency here that 'wakeup work_done' is after
> > 'wait for work_queued'. Producer has code like:
> > 
> > while (1) {
> > 	lock L
> > 	prepare_to_wait(work_done)
> > 	if (too much work queued) {
> > 		unlock L
> > 		sleep
> > 	} else {
> > 		queue work
> > 		unlock L
> > 		wake_up(work_queued)
> > 	}
> > }
> > 
> > And Dept will create dependency here that 'wakeup work_queued' is after
> > 'wait for work_done'. And thus we have a trivial cycle in the dependencies
> > despite the code being perfectly valid and safe.
> 
> Cheers,
> 
> 							- Ted

Theodore Ts'o March 3, 2022, 2:32 a.m. UTC | #26

On Thu, Mar 03, 2022 at 10:00:33AM +0900, Byungchul Park wrote:
> 
> Unfortunately, it's neither perfect nor safe without another wakeup
> source - rescue wakeup source.
> 
>    consumer			producer
> 
> 				lock L
> 				(too much work queued == true)
> 				unlock L
> 				--- preempted
>    lock L
>    unlock L
>    do work
>    lock L
>    unlock L
>    do work
>    ...
>    (no work == true)
>    sleep
> 				--- scheduled in
> 				sleep

That's not how things work in ext4.  It's **way** more complicated
than that.  We have multiple wait channels, one wake up the consumer
(read: the commit thread), and one which wakes up any processes
waiting for commit thread to have made forward progress.  We also have
two spin-lock protected sequence number, one which indicates the
current commited transaction #, and one indicating the transaction #
that needs to be committed.

On the commit thread, it will sleep on j_wait_commit, and when it is
woken up, it will check to see if there is work to be done
(j_commit_sequence != j_commit_request), and if so, do the work, and
then wake up processes waiting on the wait_queue j_wait_done_commit.
(Again, all of this uses the pattern, "prepare to wait", then check to
see if we should sleep, if we do need to sleep, unlock j_state_lock,
then sleep.   So this prevents any races leading to lost wakeups.

On the start_this_handle() thread, if we current transaction is too
full, we set j_commit_request to its transaction id to indicate that
we want the current transaction to be committed, and then we wake up
the j_wait_commit wait queue and then we enter a loop where do a
prepare_to_wait in j_wait_done_commit, check to see if
j_commit_sequence == the transaction id that we want to be completed,
and if it's not done yet, we unlock the j_state_lock spinlock, and go
to sleep.  Again, because of the prepare_to_wait, there is no chance
of a lost wakeup.

So there really is no "consumer" and "producer" here.  If you really
insist on using this model, which really doesn't apply, for one
thread, it's the consumer with respect to one wait queue, and the
producer with respect to the *other* wait queue.  For the other
thread, the consumer and producer roles are reversed.

And of course, this is a highly simplified model, since we also have a
wait queue used by the commit thread to wait for the number of active
handles on a particular transaction to go to zero, and
stop_this_handle() will wake up commit thread via this wait queue when
the last active handle on a particular transaction is retired.  (And
yes, that parameter is also protected by a different spin lock which
is per-transaction).

So it seems to me that a fundamental flaw in DEPT's model is assuming
that the only waiting paradigm that can be used is consumer/producer,
and that's simply not true.  The fact that you use the term "lock" is
also going to lead a misleading line of reasoning, because properly
speaking, they aren't really locks.  We are simply using wait channels
to wake up processes as necessary, and then they will check other
variables to decide whether or not they need to sleep or not, and we
have an invariant that when these variables change indicating forward
progress, the associated wait channel will be woken up.

Cheers,

						- Ted

P.S.  This model is also highly simplified since there are other
reasons why the commit thread can be woken up, some which might be via
a timeout, and some which is via the j_wait_commit wait channel but
not because j_commit_request has been changed, but because file system
is being unmounted, or the file system is being frozen in preparation
of a snapshot, etc.  These are *not* necessary to prevent a deadlock,
because under normal circumstances the two wake channels are
sufficient of themselves.  So please don't think of them as "rescue
wakeup sources"; again, that's highly misleading and the wrong way to
think of them.

And to make things even more complicated, we have more than 2 wait
channel --- we have *five*:

	/**
	 * @j_wait_transaction_locked:
	 *
	 * Wait queue for waiting for a locked transaction to start committing,
	 * or for a barrier lock to be released.
	 */
	wait_queue_head_t	j_wait_transaction_locked;

	/**
	 * @j_wait_done_commit: Wait queue for waiting for commit to complete.
	 */
	wait_queue_head_t	j_wait_done_commit;

	/**
	 * @j_wait_commit: Wait queue to trigger commit.
	 */
	wait_queue_head_t	j_wait_commit;

	/**
	 * @j_wait_updates: Wait queue to wait for updates to complete.
	 */
	wait_queue_head_t	j_wait_updates;

	/**
	 * @j_wait_reserved:
	 *
	 * Wait queue to wait for reserved buffer credits to drop.
	 */
	wait_queue_head_t	j_wait_reserved;

	/**
	 * @j_fc_wait:
	 *
	 * Wait queue to wait for completion of async fast commits.
	 */
	wait_queue_head_t	j_fc_wait;

"There are more things in heaven and Earth, Horatio,
 Than are dreamt of in your philosophy."
      	  	    - William Shakespeare, Hamlet

Byungchul Park March 3, 2022, 5:23 a.m. UTC | #27

Ted wrote:
> On Thu, Mar 03, 2022 at 10:00:33AM +0900, Byungchul Park wrote:
> > 
> > Unfortunately, it's neither perfect nor safe without another wakeup
> > source - rescue wakeup source.
> > 
> >    consumer			producer
> > 
> >				lock L
> >				(too much work queued == true)
> >				unlock L
> >				--- preempted
> >    lock L
> >    unlock L
> >    do work
> >    lock L
> >    unlock L
> >    do work
> >    ...
> >    (no work == true)
> >    sleep
> >				--- scheduled in
> >				sleep
> 
> That's not how things work in ext4.  It's **way** more complicated

You seem to get it wrong. This example is what Jan Kara gave me. I just
tried to explain things based on Jan Kara's example so leaving all
statements that Jan Kara wrote. Plus the example was so helpful. Thanks,
Jan Kara.

> than that.  We have multiple wait channels, one wake up the consumer
> (read: the commit thread), and one which wakes up any processes
> waiting for commit thread to have made forward progress.  We also have
> two spin-lock protected sequence number, one which indicates the
> current commited transaction #, and one indicating the transaction #
> that needs to be committed.
> 
> On the commit thread, it will sleep on j_wait_commit, and when it is
> woken up, it will check to see if there is work to be done
> (j_commit_sequence != j_commit_request), and if so, do the work, and
> then wake up processes waiting on the wait_queue j_wait_done_commit.
> (Again, all of this uses the pattern, "prepare to wait", then check to
> see if we should sleep, if we do need to sleep, unlock j_state_lock,
> then sleep.   So this prevents any races leading to lost wakeups.
> 
> On the start_this_handle() thread, if we current transaction is too
> full, we set j_commit_request to its transaction id to indicate that
> we want the current transaction to be committed, and then we wake up
> the j_wait_commit wait queue and then we enter a loop where do a
> prepare_to_wait in j_wait_done_commit, check to see if
> j_commit_sequence == the transaction id that we want to be completed,
> and if it's not done yet, we unlock the j_state_lock spinlock, and go
> to sleep.  Again, because of the prepare_to_wait, there is no chance
> of a lost wakeup.

The above explantion gives me a clear view about synchronization of
journal things. I appreciate it.

> So there really is no "consumer" and "producer" here.  If you really
> insist on using this model, which really doesn't apply, for one

Dept does not assume "consumer" and "producer" model at all, but Dept
works with general waits and events. *That model is just one of them.*

> thread, it's the consumer with respect to one wait queue, and the
> producer with respect to the *other* wait queue.  For the other
> thread, the consumer and producer roles are reversed.
> 
> And of course, this is a highly simplified model, since we also have a
> wait queue used by the commit thread to wait for the number of active
> handles on a particular transaction to go to zero, and
> stop_this_handle() will wake up commit thread via this wait queue when
> the last active handle on a particular transaction is retired.  (And
> yes, that parameter is also protected by a different spin lock which
> is per-transaction).

This one also gives me a clear view. Thanks a lot.

> So it seems to me that a fundamental flaw in DEPT's model is assuming
> that the only waiting paradigm that can be used is consumer/producer,

No, Dept does not.

> and that's simply not true.  The fact that you use the term "lock" is
> also going to lead a misleading line of reasoning, because properly

"lock/unlock L" comes from the Jan Kara's example. It has almost nothing
to do with the explanation. I just left "lock/unlock L" as a statement
that comes from the Jan Kara's example.

> speaking, they aren't really locks.  We are simply using wait channels

I totally agree with you. *They aren't really locks but it's just waits
and wakeups.* That's exactly why I decided to develop Dept. Dept is not
interested in locks unlike Lockdep, but fouces on waits and wakeup
sources itself. I think you get Dept wrong a lot. Please ask me more if
you have things you doubt about Dept.

Thanks,
Byungchul

Jan Kara March 3, 2022, 9:54 a.m. UTC | #28

On Thu 03-03-22 10:00:33, Byungchul Park wrote:
> On Mon, Feb 28, 2022 at 11:14:44AM +0100, Jan Kara wrote:
> > On Mon 28-02-22 18:28:26, Byungchul Park wrote:
> > > case 1. Code with an actual circular dependency, but not deadlock.
> > > 
> > >    A circular dependency can be broken by a rescue wakeup source e.g.
> > >    timeout. It's not a deadlock. If it's okay that the contexts
> > >    participating in the circular dependency and others waiting for the
> > >    events in the circle are stuck until it gets broken. Otherwise, say,
> > >    if it's not meant, then it's anyway problematic.
> > > 
> > >    1-1. What if we judge this code is problematic?
> > >    1-2. What if we judge this code is good?
> > > 
> > > case 2. Code with an actual circular dependency, and deadlock.
> > > 
> > >    There's no other wakeup source than those within the circular
> > >    dependency. Literally deadlock. It's problematic and critical.
> > > 
> > >    2-1. What if we judge this code is problematic?
> > >    2-2. What if we judge this code is good?
> > > 
> > > case 3. Code with no actual circular dependency, and not deadlock.
> > > 
> > >    Must be good.
> > > 
> > >    3-1. What if we judge this code is problematic?
> > >    3-2. What if we judge this code is good?
> > > 
> > > ---
> > > 
> > > I call only 3-1 "false positive" circular dependency. And you call 1-1
> > > and 3-1 "false positive" deadlock.
> > > 
> > > I've been wondering if the kernel guys esp. Linus considers code with
> > > any circular dependency is problematic or not, even if it won't lead to
> > > a deadlock, say, case 1. Even though I designed Dept based on what I
> > > believe is right, of course, I'm willing to change the design according
> > > to the majority opinion.
> > > 
> > > However, I would never allow case 1 if I were the owner of the kernel
> > > for better stability, even though the code works anyway okay for now.
> > 
> > So yes, I call a report for the situation "There is circular dependency but
> > deadlock is not possible." a false positive. And that is because in my
> > opinion your definition of circular dependency includes schemes that are
> > useful and used in the kernel.
> > 
> > Your example in case 1 is kind of borderline (I personally would consider
> > that bug as well) but there are other more valid schemes with multiple
> > wakeup sources like:
> > 
> > We have a queue of work to do Q protected by lock L. Consumer process has
> > code like:
> > 
> > while (1) {
> > 	lock L
> > 	prepare_to_wait(work_queued);
> > 	if (no work) {
> > 		unlock L
> > 		sleep
> > 	} else {
> > 		unlock L
> > 		do work
> > 		wake_up(work_done)
> > 	}
> > }
> > 
> > AFAIU Dept will create dependency here that 'wakeup work_done' is after
> > 'wait for work_queued'. Producer has code like:
> 
> First of all, thank you for this good example.
> 
> > while (1) {
> > 	lock L
> > 	prepare_to_wait(work_done)
> > 	if (too much work queued) {
> > 		unlock L
> > 		sleep
> > 	} else {
> > 		queue work
> > 		unlock L
> > 		wake_up(work_queued)
> > 	}
> > }
> > 
> > And Dept will create dependency here that 'wakeup work_queued' is after
> > 'wait for work_done'. And thus we have a trivial cycle in the dependencies
> > despite the code being perfectly valid and safe.
> 
> Unfortunately, it's neither perfect nor safe without another wakeup
> source - rescue wakeup source.
> 
>    consumer			producer
> 
> 				lock L
> 				(too much work queued == true)
> 				unlock L
> 				--- preempted
>    lock L
>    unlock L
>    do work
>    lock L
>    unlock L
>    do work
>    ...
>    (no work == true)
>    sleep
> 				--- scheduled in
> 				sleep
> 
> This code leads a deadlock without another wakeup source, say, not safe.

So the scenario you describe above is indeed possible. But the trick is
that the wakeup from 'consumer' as is doing work will remove 'producer'
from the wait queue and change the 'producer' process state to
'TASK_RUNNING'. So when 'producer' calls sleep (in fact schedule()), the
scheduler will just treat this as another preemption point and the
'producer' will immediately or soon continue to run. So indeed we can think
of this as "another wakeup source" but the source is in the CPU scheduler
itself. This is the standard way how waitqueues are used in the kernel...

> Lastly, just for your information, I need to explain how Dept works a
> little more for you not to misunderstand Dept.
> 
> Assuming the consumer and producer guarantee not to lead a deadlock like
> the following, Dept won't report it a problem:
> 
>    consumer			producer
> 
> 				sleep
>    wakeup work_done
> 				queue work
>    sleep
> 				wakeup work_queued
>    do work
> 				sleep
>    wakeup work_done
> 				queue work
>    sleep
> 				wakeup work_queued
>    do work
> 				sleep
>    ...				...
> 
> Dept does not consider all waits preceeding an event but only waits that
> might lead a deadlock. In this case, Dept works with each region
> independently.
> 
>    consumer			producer
> 
> 				sleep <- initiates region 1
>    --- region 1 starts
>    ...				...
>    --- region 1 ends
>    wakeup work_done
>    ...				...
> 				queue work
>    ...				...
>    sleep <- initiates region 2
> 				--- region 2 starts
>    ...				...
> 				--- region 2 ends
> 				wakeup work_queued
>    ...				...
>    do work
>    ...				...
> 				sleep <- initiates region 3
>    --- region 3 starts
>    ...				...
>    --- region 3 ends
>    wakeup work_done
>    ...				...
> 				queue work
>    ...				...
>    sleep <- initiates region 4
> 				--- region 4 starts
>    ...				...
> 				--- region 4 ends
> 				wakeup work_queued
>    ...				...
>    do work
>    ...				...
> 
> That is, Dept does not build dependencies across different regions. So
> you don't have to worry about unreasonable false positives that much.
> 
> Thoughts?

Thanks for explanation! And what exactly defines the 'regions'? When some
process goes to sleep on some waitqueue, this defines a start of a region
at the place where all the other processes are at that moment and wakeup of
the waitqueue is an end of the region?

								Honza

Theodore Ts'o March 3, 2022, 2:36 p.m. UTC | #29

On Thu, Mar 03, 2022 at 02:23:33PM +0900, Byungchul Park wrote:
> I totally agree with you. *They aren't really locks but it's just waits
> and wakeups.* That's exactly why I decided to develop Dept. Dept is not
> interested in locks unlike Lockdep, but fouces on waits and wakeup
> sources itself. I think you get Dept wrong a lot. Please ask me more if
> you have things you doubt about Dept.

So the question is this --- do you now understand why, even though
there is a circular dependency, nothing gets stalled in the
interactions between the two wait channels?

						- Ted

Byungchul Park March 4, 2022, 12:42 a.m. UTC | #30

On Thu, Mar 03, 2022 at 09:36:25AM -0500, Theodore Ts'o wrote:
> On Thu, Mar 03, 2022 at 02:23:33PM +0900, Byungchul Park wrote:
> > I totally agree with you. *They aren't really locks but it's just waits
> > and wakeups.* That's exactly why I decided to develop Dept. Dept is not
> > interested in locks unlike Lockdep, but fouces on waits and wakeup
> > sources itself. I think you get Dept wrong a lot. Please ask me more if
> > you have things you doubt about Dept.
> 
> So the question is this --- do you now understand why, even though
> there is a circular dependency, nothing gets stalled in the
> interactions between the two wait channels?

??? I'm afraid I don't get you.

All contexts waiting for any of the events in the circular dependency
chain will be definitely stuck if there is a circular dependency as I
explained. So we need another wakeup source to break the circle. In
ext4 code, you might have the wakeup source for breaking the circle.

What I agreed with is:

   The case that 1) the circular dependency is unevitable 2) there are
   another wakeup source for breadking the circle and 3) the duration
   in sleep is short enough, should be acceptable.

Sounds good?

Thanks,
Byungchul

Byungchul Park March 4, 2022, 1:56 a.m. UTC | #31

On Thu, Mar 03, 2022 at 10:54:56AM +0100, Jan Kara wrote:
> On Thu 03-03-22 10:00:33, Byungchul Park wrote:
> > Unfortunately, it's neither perfect nor safe without another wakeup
> > source - rescue wakeup source.
> > 
> >    consumer			producer
> > 
> > 				lock L
> > 				(too much work queued == true)
> > 				unlock L
> > 				--- preempted
> >    lock L
> >    unlock L
> >    do work
> >    lock L
> >    unlock L
> >    do work
> >    ...
> >    (no work == true)
> >    sleep
> > 				--- scheduled in
> > 				sleep
> > 
> > This code leads a deadlock without another wakeup source, say, not safe.
> 
> So the scenario you describe above is indeed possible. But the trick is
> that the wakeup from 'consumer' as is doing work will remove 'producer'
> from the wait queue and change the 'producer' process state to
> 'TASK_RUNNING'. So when 'producer' calls sleep (in fact schedule()), the
> scheduler will just treat this as another preemption point and the
> 'producer' will immediately or soon continue to run. So indeed we can think
> of this as "another wakeup source" but the source is in the CPU scheduler
> itself. This is the standard way how waitqueues are used in the kernel...

Nice! Thanks for the explanation. I will take it into account if needed.

> > Lastly, just for your information, I need to explain how Dept works a
> > little more for you not to misunderstand Dept.
> > 
> > Assuming the consumer and producer guarantee not to lead a deadlock like
> > the following, Dept won't report it a problem:
> > 
> >    consumer			producer
> > 
> > 				sleep
> >    wakeup work_done
> > 				queue work
> >    sleep
> > 				wakeup work_queued
> >    do work
> > 				sleep
> >    wakeup work_done
> > 				queue work
> >    sleep
> > 				wakeup work_queued
> >    do work
> > 				sleep
> >    ...				...
> > 
> > Dept does not consider all waits preceeding an event but only waits that
> > might lead a deadlock. In this case, Dept works with each region
> > independently.
> > 
> >    consumer			producer
> > 
> > 				sleep <- initiates region 1
> >    --- region 1 starts
> >    ...				...
> >    --- region 1 ends
> >    wakeup work_done
> >    ...				...
> > 				queue work
> >    ...				...
> >    sleep <- initiates region 2
> > 				--- region 2 starts
> >    ...				...
> > 				--- region 2 ends
> > 				wakeup work_queued
> >    ...				...
> >    do work
> >    ...				...
> > 				sleep <- initiates region 3
> >    --- region 3 starts
> >    ...				...
> >    --- region 3 ends
> >    wakeup work_done
> >    ...				...
> > 				queue work
> >    ...				...
> >    sleep <- initiates region 4
> > 				--- region 4 starts
> >    ...				...
> > 				--- region 4 ends
> > 				wakeup work_queued
> >    ...				...
> >    do work
> >    ...				...
> > 
> > That is, Dept does not build dependencies across different regions. So
> > you don't have to worry about unreasonable false positives that much.
> > 
> > Thoughts?
> 
> Thanks for explanation! And what exactly defines the 'regions'? When some
> process goes to sleep on some waitqueue, this defines a start of a region
> at the place where all the other processes are at that moment and wakeup of
> the waitqueue is an end of the region?

Yes. Let me explain it more for better understanding.
(I copied it from the talk I did with Matthew..)


   ideal view
   -----------
   context X			context Y

   request event E		...
      write REQUESTEVENT	when (notice REQUESTEVENT written)
   ...				   notice the request from X [S]

				--- ideally region 1 starts here
   wait for the event		...
      sleep			if (can see REQUESTEVENT written)
   				   it's on the way to the event
   ...				
   				...
				--- ideally region 1 ends here

				finally the event [E]

Dept basically works with the above view with regard to wait and event.
But it's very hard to identify the ideal [S] point in practice. So Dept
instead identifies [S] point by checking WAITSTART with memory barriers
like the following, which would make Dept work conservatively.


   Dept's view
   ------------
   context X			context Y

   request event E		...
      write REQUESTEVENT	when (notice REQUESTEVENT written)
   ...				   notice the request from X

				--- region 2 Dept gives up starts
   wait for the event		...
      write barrier
      write WAITSTART		read barrier
      sleep			when (notice WAITSTART written)
				   ensure the request has come [S]

				--- region 2 Dept gives up ends
				--- region 3 starts here
				...
				if (can see WAITSTART written)
				   it's on the way to the event
   ...				
   				...
				--- region 3 ends here

   				finally the event [E]

In short, Dept works with region 3.

Thanks,
Byungchul

Byungchul Park March 4, 2022, 3:20 a.m. UTC | #32

On Thu, Mar 03, 2022 at 09:36:25AM -0500, Theodore Ts'o wrote:
> On Thu, Mar 03, 2022 at 02:23:33PM +0900, Byungchul Park wrote:
> > I totally agree with you. *They aren't really locks but it's just waits
> > and wakeups.* That's exactly why I decided to develop Dept. Dept is not
> > interested in locks unlike Lockdep, but fouces on waits and wakeup
> > sources itself. I think you get Dept wrong a lot. Please ask me more if
> > you have things you doubt about Dept.
> 
> So the question is this --- do you now understand why, even though
> there is a circular dependency, nothing gets stalled in the
> interactions between the two wait channels?

I found a point that the two wait channels don't lead a deadlock in
some cases thanks to Jan Kara. I will fix it so that Dept won't
complain it.

Thanks,
Byungchul

> 
> 						- Ted

Theodore Ts'o March 5, 2022, 3:26 a.m. UTC | #33

On Fri, Mar 04, 2022 at 09:42:37AM +0900, Byungchul Park wrote:
> 
> All contexts waiting for any of the events in the circular dependency
> chain will be definitely stuck if there is a circular dependency as I
> explained. So we need another wakeup source to break the circle. In
> ext4 code, you might have the wakeup source for breaking the circle.
> 
> What I agreed with is:
> 
>    The case that 1) the circular dependency is unevitable 2) there are
>    another wakeup source for breadking the circle and 3) the duration
>    in sleep is short enough, should be acceptable.
> 
> Sounds good?

These dependencies are part of every single ext4 metadata update,
and if there were any unnecessary sleeps, this would be a major
performance gap, and this is a very well studied part of ext4.

There are some places where we sleep, sure.  In some case
start_this_handle() needs to wait for a commit to complete, and the
commit thread might need to sleep for I/O to complete.  But the moment
the thing that we're waiting for is complete, we wake up all of the
processes on the wait queue.  But in the case where we wait for I/O
complete, that wakeupis coming from the device driver, when it
receives the the I/O completion interrupt from the hard drive.  Is
that considered an "external source"?  Maybe DEPT doesn't recognize
that this is certain to happen just as day follows the night?  (Well,
maybe the I/O completion interrupt might not happen if the disk drive
bursts into flames --- but then, you've got bigger problems. :-)

In any case, if DEPT is going to report these "circular dependencies
as bugs that MUST be fixed", it's going to be pure noise and I will
ignore all DEPT reports, and will push back on having Lockdep replaced
by DEPT --- because Lockdep give us actionable reports, and if DEPT
can't tell the difference between a valid programming pattern and a
bug, then it's worse than useless.

Sounds good?

							- Ted

Theodore Ts'o March 5, 2022, 3:40 a.m. UTC | #34

On Fri, Mar 04, 2022 at 12:20:02PM +0900, Byungchul Park wrote:
> 
> I found a point that the two wait channels don't lead a deadlock in
> some cases thanks to Jan Kara. I will fix it so that Dept won't
> complain it.

I sent my last (admittedly cranky) message before you sent this.  I'm
glad you finally understood Jan's explanation.  I was trying to tell
you the same thing, but apparently I failed to communicate in a
sufficiently clear manner.  In any case, what Jan described is a
fundamental part of how wait queues work, and I'm kind of amazed that
you were able to implement DEPT without understanding it.  (But maybe
that is why some of the DEPT reports were completely incomprehensible
to me; I couldn't interpret why in the world DEPT was saying there was
a problem.)

In any case, the thing I would ask is a little humility.  We regularly
use lockdep, and we run a huge number of stress tests, throughout each
development cycle.

So if DEPT is issuing lots of reports about apparently circular
dependencies, please try to be open to the thought that the fault is
in DEPT, and don't try to argue with maintainers that their code MUST
be buggy --- but since you don't understand our code, and DEPT must be
theoretically perfect, that it is up to the Maintainers to prove to
you that their code is correct.

I am going to gently suggest that it is at least as likely, if not
more likely, that the failure is in DEPT or your understanding of what
how kernel wait channels and locking works.  After all, why would it
be that we haven't found these problems via our other QA practices?

Cheers,

						- Ted

Byungchul Park March 5, 2022, 2:15 p.m. UTC | #35

On Fri, Mar 04, 2022 at 10:26:23PM -0500, Theodore Ts'o wrote:
> On Fri, Mar 04, 2022 at 09:42:37AM +0900, Byungchul Park wrote:
> > 
> > All contexts waiting for any of the events in the circular dependency
> > chain will be definitely stuck if there is a circular dependency as I
> > explained. So we need another wakeup source to break the circle. In
> > ext4 code, you might have the wakeup source for breaking the circle.
> > 
> > What I agreed with is:
> > 
> >    The case that 1) the circular dependency is unevitable 2) there are
> >    another wakeup source for breadking the circle and 3) the duration
> >    in sleep is short enough, should be acceptable.
> > 
> > Sounds good?
> 
> These dependencies are part of every single ext4 metadata update,
> and if there were any unnecessary sleeps, this would be a major
> performance gap, and this is a very well studied part of ext4.
> 
> There are some places where we sleep, sure.  In some case
> start_this_handle() needs to wait for a commit to complete, and the
> commit thread might need to sleep for I/O to complete.  But the moment
> the thing that we're waiting for is complete, we wake up all of the
> processes on the wait queue.  But in the case where we wait for I/O
> complete, that wakeupis coming from the device driver, when it
> receives the the I/O completion interrupt from the hard drive.  Is
> that considered an "external source"?  Maybe DEPT doesn't recognize
> that this is certain to happen just as day follows the night?  (Well,
> maybe the I/O completion interrupt might not happen if the disk drive
> bursts into flames --- but then, you've got bigger problems. :-)

Almost all you've been blaming at Dept are totally non-sense. Based on
what you're saying, I'm conviced that you don't understand how Dept
works even 1%. You don't even try to understand it before blame.

You don't have to understand and support it. But I can't response to you
if you keep saying silly things that way.

> In any case, if DEPT is going to report these "circular dependencies
> as bugs that MUST be fixed", it's going to be pure noise and I will
> ignore all DEPT reports, and will push back on having Lockdep replaced

Dept is going to be improved so that what you are concerning about won't
be reported.

> by DEPT --- because Lockdep give us actionable reports, and if DEPT

Right. Dept should give actionable reports, too.

> can't tell the difference between a valid programming pattern and a
> bug, then it's worse than useless.

Needless to say.

Byungchul Park March 5, 2022, 2:55 p.m. UTC | #36

On Fri, Mar 04, 2022 at 10:40:35PM -0500, Theodore Ts'o wrote:
> On Fri, Mar 04, 2022 at 12:20:02PM +0900, Byungchul Park wrote:
> > 
> > I found a point that the two wait channels don't lead a deadlock in
> > some cases thanks to Jan Kara. I will fix it so that Dept won't
> > complain it.
> 
> I sent my last (admittedly cranky) message before you sent this.  I'm
> glad you finally understood Jan's explanation.  I was trying to tell

Not finally. I've understood him whenever he tried to tell me something.

> you the same thing, but apparently I failed to communicate in a

I don't think so. Your point and Jan's point are different. All he has
said make sense. But yours does not.

> sufficiently clear manner.  In any case, what Jan described is a
> fundamental part of how wait queues work, and I'm kind of amazed that
> you were able to implement DEPT without understanding it.  (But maybe

Of course, it was possible because all that Dept has to know for basic
work is wait and event. The subtle things like what Jan told me help
Dept be better.

> that is why some of the DEPT reports were completely incomprehensible

It's because you are blinded to blame at it without understanding how
Dept works at all. I will fix those that must be fixed. Don't worry.

> to me; I couldn't interpret why in the world DEPT was saying there was
> a problem.)

I can tell you if you really want to understand why. But I can't if you
are like this.

> In any case, the thing I would ask is a little humility.  We regularly
> use lockdep, and we run a huge number of stress tests, throughout each
> development cycle.

Sure.

> So if DEPT is issuing lots of reports about apparently circular
> dependencies, please try to be open to the thought that the fault is

No one was convinced that Dept doesn't have a fault. I think your
worries are too much.

> in DEPT, and don't try to argue with maintainers that their code MUST
> be buggy --- but since you don't understand our code, and DEPT must be

No one argued that their code must be buggy, either. So I don't think
you have to worry about what's never happened.

> theoretically perfect, that it is up to the Maintainers to prove to
> you that their code is correct.
> 
> I am going to gently suggest that it is at least as likely, if not
> more likely, that the failure is in DEPT or your understanding of what

No doubt. I already think so. But it doesn't mean that I have to keep
quiet without discussing to imporve Dept. I will keep improving Dept in
a reasonable way.

> how kernel wait channels and locking works.  After all, why would it
> be that we haven't found these problems via our other QA practices?

Let's talk more once you understand how Dept works at least 10%. Or I
think we cannot talk in a productive way.

Joel Fernandes March 5, 2022, 3:05 p.m. UTC | #37

On Sat, Mar 05, 2022 at 11:15:38PM +0900, Byungchul Park wrote:
> On Fri, Mar 04, 2022 at 10:26:23PM -0500, Theodore Ts'o wrote:
> > On Fri, Mar 04, 2022 at 09:42:37AM +0900, Byungchul Park wrote:
> > > 
> > > All contexts waiting for any of the events in the circular dependency
> > > chain will be definitely stuck if there is a circular dependency as I
> > > explained. So we need another wakeup source to break the circle. In
> > > ext4 code, you might have the wakeup source for breaking the circle.
> > > 
> > > What I agreed with is:
> > > 
> > >    The case that 1) the circular dependency is unevitable 2) there are
> > >    another wakeup source for breadking the circle and 3) the duration
> > >    in sleep is short enough, should be acceptable.
> > > 
> > > Sounds good?
> > 
> > These dependencies are part of every single ext4 metadata update,
> > and if there were any unnecessary sleeps, this would be a major
> > performance gap, and this is a very well studied part of ext4.
> > 
> > There are some places where we sleep, sure.  In some case
> > start_this_handle() needs to wait for a commit to complete, and the
> > commit thread might need to sleep for I/O to complete.  But the moment
> > the thing that we're waiting for is complete, we wake up all of the
> > processes on the wait queue.  But in the case where we wait for I/O
> > complete, that wakeupis coming from the device driver, when it
> > receives the the I/O completion interrupt from the hard drive.  Is
> > that considered an "external source"?  Maybe DEPT doesn't recognize
> > that this is certain to happen just as day follows the night?  (Well,
> > maybe the I/O completion interrupt might not happen if the disk drive
> > bursts into flames --- but then, you've got bigger problems. :-)
> 
> Almost all you've been blaming at Dept are totally non-sense. Based on
> what you're saying, I'm conviced that you don't understand how Dept
> works even 1%. You don't even try to understand it before blame.
> 
> You don't have to understand and support it. But I can't response to you
> if you keep saying silly things that way.

Byungchul, other than ext4 have there been any DEPT reports that other
subsystem maintainers' agree were valid usecases?

Regarding false-positives, just to note lockdep is not without its share of
false-positives. Just that (as you know), the signal-to-noise ratio should be
high for it to be useful. I've put up with lockdep's false positives just
because it occasionally saves me from catastrophe.

> > In any case, if DEPT is going to report these "circular dependencies
> > as bugs that MUST be fixed", it's going to be pure noise and I will
> > ignore all DEPT reports, and will push back on having Lockdep replaced
> 
> Dept is going to be improved so that what you are concerning about won't
> be reported.

Yeah I am looking forward to learning more about it however I was wondering
about the following: lockdep can already be used for modeling "resource
acquire/release" and "resource wait" semantics that are unrelated to locks,
like we do in mm reclaim. I am wondering why we cannot just use those existing
lockdep mechanisms for the wait/wake usecases (Assuming that we can agree
that circular dependencies on related to wait/wake is a bad thing. Or perhaps
there's a reason why Peter Zijlstra did not use lockdep for wait/wake
dependencies (such as multiple wake sources) considering he wrote a lot of
that code.

Keep kicking ass brother, you're doing great.

Thanks,

     Joel

Reimar Döffinger March 5, 2022, 3:12 p.m. UTC | #38

Hi,
Sorry to butt in as an outsider, but this seems like a shockingly disrespectful discussion for such a wide CC list.
I don't want to make rules how you discuss things (I very rarely contribute), and I see the value in a frank discussion, but maybe you could continue with a reduced CC list?
I find it unlikely that I am the only one who could do without this.

Best regards,
Reimar Döffinger

> On 5 Mar 2022, at 15:55, Byungchul Park <byungchul.park@lge.com> wrote:
> 
> On Fri, Mar 04, 2022 at 10:40:35PM -0500, Theodore Ts'o wrote:
>> On Fri, Mar 04, 2022 at 12:20:02PM +0900, Byungchul Park wrote:
>>> 
>>> I found a point that the two wait channels don't lead a deadlock in
>>> some cases thanks to Jan Kara. I will fix it so that Dept won't
>>> complain it.
>> 
>> I sent my last (admittedly cranky) message before you sent this.  I'm
>> glad you finally understood Jan's explanation.  I was trying to tell
> 
> Not finally. I've understood him whenever he tried to tell me something.
> 
>> you the same thing, but apparently I failed to communicate in a
> 
> I don't think so. Your point and Jan's point are different. All he has
> said make sense. But yours does not.
> 
>> sufficiently clear manner.  In any case, what Jan described is a
>> fundamental part of how wait queues work, and I'm kind of amazed that
>> you were able to implement DEPT without understanding it.  (But maybe
> 
> Of course, it was possible because all that Dept has to know for basic
> work is wait and event. The subtle things like what Jan told me help
> Dept be better.
> 
>> that is why some of the DEPT reports were completely incomprehensible
> 
> It's because you are blinded to blame at it without understanding how
> Dept works at all. I will fix those that must be fixed. Don't worry.
> 
>> to me; I couldn't interpret why in the world DEPT was saying there was
>> a problem.)
> 
> I can tell you if you really want to understand why. But I can't if you
> are like this.
> 
>> In any case, the thing I would ask is a little humility.  We regularly
>> use lockdep, and we run a huge number of stress tests, throughout each
>> development cycle.
> 
> Sure.
> 
>> So if DEPT is issuing lots of reports about apparently circular
>> dependencies, please try to be open to the thought that the fault is
> 
> No one was convinced that Dept doesn't have a fault. I think your
> worries are too much.
> 
>> in DEPT, and don't try to argue with maintainers that their code MUST
>> be buggy --- but since you don't understand our code, and DEPT must be
> 
> No one argued that their code must be buggy, either. So I don't think
> you have to worry about what's never happened.
> 
>> theoretically perfect, that it is up to the Maintainers to prove to
>> you that their code is correct.
>> 
>> I am going to gently suggest that it is at least as likely, if not
>> more likely, that the failure is in DEPT or your understanding of what
> 
> No doubt. I already think so. But it doesn't mean that I have to keep
> quiet without discussing to imporve Dept. I will keep improving Dept in
> a reasonable way.
> 
>> how kernel wait channels and locking works.  After all, why would it
>> be that we haven't found these problems via our other QA practices?
> 
> Let's talk more once you understand how Dept works at least 10%. Or I
> think we cannot talk in a productive way.
>

Theodore Ts'o March 6, 2022, 3:30 a.m. UTC | #39

On Sat, Mar 05, 2022 at 11:55:34PM +0900, Byungchul Park wrote:
> > that is why some of the DEPT reports were completely incomprehensible
> 
> It's because you are blinded to blame at it without understanding how
> Dept works at all. I will fix those that must be fixed. Don't worry.

Users of DEPT must not have to understand how DEPT works in order to
understand and use DEPT reports.  If you think I don't understand how
DEPT work, I'm going to gently suggest that this means DEPT reports
are clear enough, and/or DEPT documentation needs to be
*substantially* improved, or both --- and these needs to happen before
DEPT is ready to be merged.

> > So if DEPT is issuing lots of reports about apparently circular
> > dependencies, please try to be open to the thought that the fault is
> 
> No one was convinced that Dept doesn't have a fault. I think your
> worries are too much.

In that case, may I ask that you add back a RFC to the subject prefix
(e.g., [PATCH RFC -v5]?)  Or maybe even add the subject prefix NOT YET
READY?  I have seen cases when after a patch series get to PATCH -v22,
and then people assume that it *must* be ready, as opposed what it
really means, which is "the author is just persistently reposting and
rebasing the patch series over and over again".  It would be helpful
if you directly acknowledge, in each patch submission, that it is not
yet ready for prime time.

After all, right now, DEPT has generated two reports in ext4, both of
which were false positives, and both of which have required a lot of
maintainer times to prove to you that they were in fact false
positives.  So are we all agreed that DEPT is not ready for prime
time?

> No one argued that their code must be buggy, either. So I don't think
> you have to worry about what's never happened.

Well, you kept on insisting that ext4 must have a circular dependency,
and that depending on a "rescue wakeup" is bad programming practice,
but you'll reluctantly agree to make DEPT accept "rescue wakeups" if
that is the will of the developers.  My concern here is the
fundmaental concept of "rescue wakeups" is wrong; I don't see how you
can distinguish between a valid wakeup and one that you and DEPT is
going to somehow characterize as dodgy.

Consider: a process can first subscribe to multiple wait queues, and
arrange to be woken up by a timeout, and then call schedule() to go to
sleep.  So it is not waiting on a single wait channel, but potentially
*multiple* wakeup sources.  If you are going to prove that kernel is
not going to make forward progress, you need to prove that *all* ways
that process might not wake up aren't going to happen for some reason.

Just because one wakeup source seems to form a circular dependency
proves nothing, since another wakeup source might be the designed and
architected way that code makes forward progress.

You seem to be assuminng that one wakeup source is somehow the
"correct" one, and the other ways that process could be woken up is a
"rescue wakeup source" and you seem to believe that relying on a
"rescue wakeup source" is bad.  But in the case of a process which has
called prepare-to-wait on more than one wait queue, how is DEPT going
to distinguish between your "morally correct" wkaeup source, and the
"rescue wakeup source"?

> No doubt. I already think so. But it doesn't mean that I have to keep
> quiet without discussing to imporve Dept. I will keep improving Dept in
> a reasonable way.

Well, I don't want to be in a position of having to prove that every
single DEPT report in my code that doesn't make sense to me is
nonsense, or else DEPT will get merged.

So maybe we need to reverse the burden of proof.

Instead of just sending a DEPT report, and then asking the maintainers
to explain why it is a false positive, how about if *you* use the DEPT
report to examinie the subsystem code, and then explain plain English,
how you think this could trigger in real life, or cause a performance
problem in real life or perhaps provide a script or C reproducer that
triggers the supposed deadlock?

Yes, that means you will need to understand the code in question, but
hopefully the DEPT reports should be clear enough that someone who
isn't a deep expert in the code should be able to spot the bug.  If
not, and if only a few deep experts of code in question will be able
to decipher the DEPT report and figure out a fix, that's really not
ideal.

If DEPT can find a real bug and you can show that Lockdep wouldn't
have been able to find it, then that would be proof that it is making
a real contribution.  That's would be real benefit.  At the same time,
DEPT will hopefully be able to demonstrate a false positive rate which
is low enough that the benefits clearly outweight the costs.

At the moment, I believe the scoreboard for DEPT with respect to ext4
is zero real bugs found, and two false positives, both of which have
required significant rounds of e-mail before the subsystem maintainers
were able to prove to you that it was, indeed, DEPT reporting a false
positive.

Do you now understand why I am so concerned that you aren't putting an
RFC or NOT YET READY in the subject line?

						- Ted

P.S.  If DEPT had a CONFIG_EXPERIMENTAL, with a disclaimer in the
KConfig that some of its reports might be false positives, that might
be another way of easing my fears that this won't get used by
Syzkaller, and to generate a lot of burdensome triage work on the
maintainers.

Byungchul Park March 6, 2022, 10:51 a.m. UTC | #40

Ted wrote:
> On Sat, Mar 05, 2022 at 11:55:34PM +0900, Byungchul Park wrote:
> > > that is why some of the DEPT reports were completely incomprehensible
> > 
> > It's because you are blinded to blame at it without understanding how
> > Dept works at all. I will fix those that must be fixed. Don't worry.
> 
> Users of DEPT must not have to understand how DEPT works in order to

Users must not have to understand how Dept works for sure, and haters
must not blame things based on what they guess wrong.

> understand and use DEPT reports.  If you think I don't understand how
> DEPT work, I'm going to gently suggest that this means DEPT reports
> are clear enough, and/or DEPT documentation needs to be
> *substantially* improved, or both --- and these needs to happen before
> DEPT is ready to be merged.

Okay.

> > > So if DEPT is issuing lots of reports about apparently circular
> > > dependencies, please try to be open to the thought that the fault is
> > 
> > No one was convinced that Dept doesn't have a fault. I think your
> > worries are too much.
> 
> In that case, may I ask that you add back a RFC to the subject prefix
> (e.g., [PATCH RFC -v5]?)  Or maybe even add the subject prefix NOT YET

I will.

> READY?  I have seen cases when after a patch series get to PATCH -v22,
> and then people assume that it *must* be ready, as opposed what it
> really means, which is "the author is just persistently reposting and
> rebasing the patch series over and over again".  It would be helpful
> if you directly acknowledge, in each patch submission, that it is not
> yet ready for prime time.
> 
> After all, right now, DEPT has generated two reports in ext4, both of
> which were false positives, and both of which have required a lot of
> maintainer times to prove to you that they were in fact false
> positives.  So are we all agreed that DEPT is not ready for prime
> time?

Yes.

> > No one argued that their code must be buggy, either. So I don't think
> > you have to worry about what's never happened.
> 
> Well, you kept on insisting that ext4 must have a circular dependency,
> and that depending on a "rescue wakeup" is bad programming practice,
> but you'll reluctantly agree to make DEPT accept "rescue wakeups" if
> that is the will of the developers.  My concern here is the
> fundmaental concept of "rescue wakeups" is wrong; I don't see how you
> can distinguish between a valid wakeup and one that you and DEPT is
> going to somehow characterize as dodgy.

Your concern on it makes sense. I need to explain how I think about it
more, but not now cuz I guess the other folks alrealy got tired enough.
Let's talk about it later when needed again.

> Consider: a process can first subscribe to multiple wait queues, and
> arrange to be woken up by a timeout, and then call schedule() to go to
> sleep.  So it is not waiting on a single wait channel, but potentially
> *multiple* wakeup sources.  If you are going to prove that kernel is
> not going to make forward progress, you need to prove that *all* ways
> that process might not wake up aren't going to happen for some reason.
> 
> Just because one wakeup source seems to form a circular dependency
> proves nothing, since another wakeup source might be the designed and
> architected way that code makes forward progress.

I also think it's legal if the design is intended. But it's not if not.
This is what I meant. Again, it's not about ext4.

> You seem to be assuminng that one wakeup source is somehow the
> "correct" one, and the other ways that process could be woken up is a
> "rescue wakeup source" and you seem to believe that relying on a
> "rescue wakeup source" is bad.  But in the case of a process which has

It depends on whether or not it's intended IMHO.

> called prepare-to-wait on more than one wait queue, how is DEPT going
> to distinguish between your "morally correct" wkaeup source, and the
> "rescue wakeup source"?

Sure, it should be done manually. I should do it on my own when that
kind of issue arises. Agian, ext4 is not the case cuz, based on what Jan
explained, there's no real circular dependency wrt commit wq, done wq
and so on.

> > No doubt. I already think so. But it doesn't mean that I have to keep
> > quiet without discussing to imporve Dept. I will keep improving Dept in
> > a reasonable way.
> 
> Well, I don't want to be in a position of having to prove that every
> single DEPT report in my code that doesn't make sense to me is
> nonsense, or else DEPT will get merged.
> 
> So maybe we need to reverse the burden of proof.

I will keep in mind.

> Instead of just sending a DEPT report, and then asking the maintainers
> to explain why it is a false positive, how about if *you* use the DEPT
> report to examinie the subsystem code, and then explain plain English,
> how you think this could trigger in real life, or cause a performance
> problem in real life or perhaps provide a script or C reproducer that
> triggers the supposed deadlock?

Makes sense. Let me try.

> Yes, that means you will need to understand the code in question, but
> hopefully the DEPT reports should be clear enough that someone who
> isn't a deep expert in the code should be able to spot the bug.  If
> not, and if only a few deep experts of code in question will be able
> to decipher the DEPT report and figure out a fix, that's really not
> ideal.

Agree. Just FYI, I've never blamed you are not the expert on Dept.

> If DEPT can find a real bug and you can show that Lockdep wouldn't
> have been able to find it, then that would be proof that it is making
> a real contribution.  That's would be real benefit.  At the same time,
> DEPT will hopefully be able to demonstrate a false positive rate which
> is low enough that the benefits clearly outweight the costs.
> 
> At the moment, I believe the scoreboard for DEPT with respect to ext4
> is zero real bugs found, and two false positives, both of which have
> required significant rounds of e-mail before the subsystem maintainers
> were able to prove to you that it was, indeed, DEPT reporting a false
> positive.

Right. But we've barely talked in a productive way. We've talked about
other things way more than proving what you're mentioning.

> Do you now understand why I am so concerned that you aren't putting an
> RFC or NOT YET READY in the subject line?

Yes. I will.

> - Ted
> 
> P.S.  If DEPT had a CONFIG_EXPERIMENTAL, with a disclaimer in the
> KConfig that some of its reports might be false positives, that might
> be another way of easing my fears that this won't get used by
> Syzkaller, and to generate a lot of burdensome triage work on the
> maintainers.

Thank you for straightforward explanation this time,
Byungchul

Theodore Ts'o March 6, 2022, 2:19 p.m. UTC | #41

On Sun, Mar 06, 2022 at 07:51:42PM +0900, Byungchul Park wrote:
> > 
> > Users of DEPT must not have to understand how DEPT works in order to
> 
> Users must not have to understand how Dept works for sure, and haters
> must not blame things based on what they guess wrong.

For the record, I don't hate DEPT.  I *fear* that DEPT will result in
my getting spammed with a huge number of false posiives once automated
testing systems like Syzkaller, zero-day test robot, etcs., get a hold
of it once it gets merged and start generating hundreds of automated
reports.

And when I tried to read the DEPT reports, and the DEPT documentation,
and I found that its explanation for why ext4 had a circular
dependency simply did not make sense.  If my struggles to understand
why DEPT was issuing a false positive is "guessing", then how do we
have discussions over how to make DEPT better?

> > called prepare-to-wait on more than one wait queue, how is DEPT going
> > to distinguish between your "morally correct" wkaeup source, and the
> > "rescue wakeup source"?
> 
> Sure, it should be done manually. I should do it on my own when that
> kind of issue arises.

The question here is how often will it need to be done, and how easy
will it be to "do it manually"?  Suppose we mark all of the DEPT false
positives before it gets merged?  How easy will it be able to suppress
future false positives in the future, as the kernel evolves?

Perhaps one method is to haved a way to take a particular wait queue,
or call to schedule(), or at the level of an entire kernel source
file, and opt it out from DEPT analysis?  That way, if DEPT gets
merged, and a maintainer starts getting spammed by bogus (or
incomprehensible) reports, there is a simople way they can annotate
their source code to prevent DEPT from analyzing code that it is
apparently not able to understand correctly.

That way we don't necessarily need to have a debate over how close to
zero percent false positives is necessary before DEPT can get merged.
And we avoid needing to force maintainers to prove that a DEPT report
is a false positive, which is from my experience hard to do, since
they get accused of being DEPT haters and not understanding DEPT.

	  	   	      	    	    	     - Ted

Byungchul Park March 7, 2022, 2:43 a.m. UTC | #42

On Sat, Mar 05, 2022 at 03:05:23PM +0000, Joel Fernandes wrote:
> On Sat, Mar 05, 2022 at 11:15:38PM +0900, Byungchul Park wrote:
> > Almost all you've been blaming at Dept are totally non-sense. Based on
> > what you're saying, I'm conviced that you don't understand how Dept
> > works even 1%. You don't even try to understand it before blame.
> > 
> > You don't have to understand and support it. But I can't response to you
> > if you keep saying silly things that way.
> 
> Byungchul, other than ext4 have there been any DEPT reports that other
> subsystem maintainers' agree were valid usecases?

Not yet.

> Regarding false-positives, just to note lockdep is not without its share of
> false-positives. Just that (as you know), the signal-to-noise ratio should be
> high for it to be useful. I've put up with lockdep's false positives just
> because it occasionally saves me from catastrophe.

I love your insight. Agree. A tool would be useful only when it's
*actually* helpful. I hope Dept would be so.

> > > In any case, if DEPT is going to report these "circular dependencies
> > > as bugs that MUST be fixed", it's going to be pure noise and I will
> > > ignore all DEPT reports, and will push back on having Lockdep replaced
> > 
> > Dept is going to be improved so that what you are concerning about won't
> > be reported.
> 
> Yeah I am looking forward to learning more about it however I was wondering
> about the following: lockdep can already be used for modeling "resource
> acquire/release" and "resource wait" semantics that are unrelated to locks,
> like we do in mm reclaim. I am wondering why we cannot just use those existing
> lockdep mechanisms for the wait/wake usecases (Assuming that we can agree

1. Lockdep can't work with general waits/events happening across
   contexts basically. To get over this, manual tagging of
   acquire/release can be used at each section that we suspect. But
   unfortunately, we cannot use the method if we cannot simply identify
   the sections. Furthermore, it's inevitable to miss sections that
   shouldn't get missed.

2. Some cases should be correctly tracked via wait/event model, not
   acquisition order model. For example, read-lock in rwlock should be
   defined as a waiter waiting for write-unlock, write-lock in rwlock
   as a waiter waiting for either read-unlock or write-unlock.
   Otherwise, if we try to track those cases using acquisition order,
   it cannot completely work. Don't you think it looks werid?

3. Tracking what we didn't track before means both stronger detection
   and new emergence of false positives, exactly same as Lockdep at its
   beginning when it started to track what we hadn't tracked before.
   Even though the emergence was allowed at that time, now that Locdkep
   got stable enough, folks would be more strict on new emergences. It's
   gonna get even worse if valid reports are getting prevented by false
   positives.

   For that reason, multi reporting functionality is essential. I was
   thinking to improve Lockdep to allow multi reporting. But it might be
   needed to change more than developing a new tool from scratch. Plus
   it might be even more difficult cuz Lockdep already works not badly.
   So even for Lockdep, I thought the new thing should be developed
   independently leaving Lockdep as it is.

4. (minor reason) The concept and name of acquisition and release is not
   for general wait/event. The design and implementation are not,
   either. I wanted to address the issue as soon as possible before we
   squeeze out Lockdep to use for general wait/event more and the kernel
   code gets weird. Of course, it doesn't mean Dept is more stable than
   Lockdep. However, I can tell Dept works what a dependency tool should
   do and we need to make the code go right.

> that circular dependencies on related to wait/wake is a bad thing. Or perhaps
> there's a reason why Peter Zijlstra did not use lockdep for wait/wake
> dependencies (such as multiple wake sources) considering he wrote a lot of
> that code.
> 
> Keep kicking ass brother, you're doing great.

Thank you! I'll go through this in a right way so as not to disappoint
you!

Thanks,
Byungchul

Byungchul Park March 10, 2022, 1:45 a.m. UTC | #43

On Sun, Mar 06, 2022 at 09:19:10AM -0500, Theodore Ts'o wrote:
> On Sun, Mar 06, 2022 at 07:51:42PM +0900, Byungchul Park wrote:
> > > 
> > > Users of DEPT must not have to understand how DEPT works in order to
> > 
> > Users must not have to understand how Dept works for sure, and haters
> > must not blame things based on what they guess wrong.
> 
> For the record, I don't hate DEPT.  I *fear* that DEPT will result in
> my getting spammed with a huge number of false posiives once automated
> testing systems like Syzkaller, zero-day test robot, etcs., get a hold
> of it once it gets merged and start generating hundreds of automated
> reports.

Agree. Dept should not be a part of *automated testing system* until it
finally works as much as Lockdep in terms of false positives. However,
it's impossible to achieve it by doing it out of the tree.

Besides automated testing system, Dept works great in the middle of
developing something that is so complicated in terms of synchronization.
They don't have to worry about real reports anymore, that should be
reported, from getting prevented by a false positve.

I will explicitely describe EXPERIMENTAL and "Dept might false-alarm" in
Kconfig until it's considered a few-false-alarming tool.

> > Sure, it should be done manually. I should do it on my own when that
> > kind of issue arises.
> 
> The question here is how often will it need to be done, and how easy

I guess it's gonna rarely happens. I want to see too.

> will it be to "do it manually"?  Suppose we mark all of the DEPT false

Very easy. Equal to or easier than the way we do for lockdep. But the
interest would be wait/event objects rather than locks.

> positives before it gets merged?  How easy will it be able to suppress
> future false positives in the future, as the kernel evolves?

Same as - or even better than - what we do for Lockdep.

And we'd better consider those activies as a code-documentation. Not
only making things just work but organizing code and documenting
in code, are also very meaningful.

> Perhaps one method is to haved a way to take a particular wait queue,
> or call to schedule(), or at the level of an entire kernel source
> file, and opt it out from DEPT analysis?  That way, if DEPT gets
> merged, and a maintainer starts getting spammed by bogus (or

Once Dept gets stable - hoplefully now that Dept is working very
conservatively, there might not be as many false positives as you're
concerning. The situation is in control.

> That way we don't necessarily need to have a debate over how close to
> zero percent false positives is necessary before DEPT can get merged.

Non-sense. I would agree with you if it was so when Lockdep was merged.
But I'll try to achieve almost zero false positives, again, it's
impossible to do it out of tree.

> And we avoid needing to force maintainers to prove that a DEPT report

So... It'd be okay if Dept goes not as a part of automated testing
system. Right?

> is a false positive, which is from my experience hard to do, since
> they get accused of being DEPT haters and not understanding DEPT.

Honestly, it's not a problem of that they don't understand other
domians than what they are familiar with, but another issue. I won't
mention it.

And it sounds like you'd do nothing unless it turns out to be
problematic 100%. It's definitely the *easiest* way to maintain
something because it's the same as not checking its correctness at all.

Even if it's so hard to do, checking if the code is safe for real
repeatedly, is what it surely should be done. Again, I understand it
would be freaking hard. But it doesn't mean we should avoid it.

Here, there seems to be two points you'd like to say:

1. Fundamental question. Does Dept track wait and event correctly?
2. Even if so, can Dept consider all the subtle things in the kernel?

For 1, I'm willing to response to whatever it is. And not only me but we
can make it perfectly work if the concept and direction is *right*.
For 2, I need to ask things and try my best to fix those if it exists.

Again. Dept won't be a part of *automated testing system* until it
finally works as much as Lockdep in terms of false positives. Hopefully
you are okay with it.

---
Byungchul

[00/16] DEPT(Dependency Tracker)

Message

Comments