mbox series

[RFC,v6,00/21] DEPT(Dependency Tracker)

Message ID 1651652269-15342-1-git-send-email-byungchul.park@lge.com (mailing list archive)
Headers show
Series DEPT(Dependency Tracker) | expand

Message

Byungchul Park May 4, 2022, 8:17 a.m. UTC
Hi Linus and folks,

I've been developing a tool for detecting deadlock possibilities by
tracking wait/event rather than lock(?) acquisition order to try to
cover all synchonization machanisms. It's done on v5.18-rc3 tag.

https://github.com/lgebyungchulpark/linux-dept/commits/dept1.20_on_v5.18-rc3

Benifit:

	0. Works with all lock primitives.
	1. Works with wait_for_completion()/complete().
	2. Works with 'wait' on PG_locked.
	3. Works with 'wait' on PG_writeback.
	4. Works with swait/wakeup.
	5. Works with waitqueue.
	6. Multiple reports are allowed.
	7. Deduplication control on multiple reports.
	8. Withstand false positives thanks to 6.
	9. Easy to tag any wait/event.

Future work:

	0. To make it more stable.
	1. To separates Dept from Lockdep.
	2. To improves performance in terms of time and space.
	3. To use Dept as a dependency engine for Lockdep.
	4. To add any missing tags of wait/event in the kernel.
	5. To deduplicate stack trace.

How to interpret reports:

	1. E(event) in each context cannot be triggered because of the
	   W(wait) that cannot be woken.
	2. The stack trace helping find the problematic code is located
	   in each conext's detail.

Thanks,
Byungchul

---

Changes from v5:

	1. Use just pr_warn_once() rather than WARN_ONCE() on the lack
	   of internal resources because WARN_*() printing stacktrace is
	   too much for informing the lack. (feedback from Ted, Hyeonggon)
	2. Fix trivial bugs like missing initializing a struct before
	   using it.
	3. Assign a different class per task when handling onstack
	   variables for waitqueue or the like. Which makes Dept
	   distinguish between onstack variables of different tasks so
	   as to prevent false positives. (reported by Hyeonggon)
	4. Make Dept aware of even raw_local_irq_*() to prevent false
	   positives. (reported by Hyeonggon)
	5. Don't consider dependencies between the events that might be
	   triggered within __schedule() and the waits that requires
	    __schedule(), real ones. (reported by Hyeonggon)
	6. Unstage the staged wait that has prepare_to_wait_event()'ed
	   *and* yet to get to __schedule(), if we encounter __schedule()
	   in-between for another sleep, which is possible if e.g. a
	   mutex_lock() exists in 'condition' of ___wait_event().
	7. Turn on CONFIG_PROVE_LOCKING when CONFIG_DEPT is on, to rely
	   on the hardirq and softirq entrance tracing to make Dept more
	   portable for now.

Changes from v4:

	1. Fix some bugs that produce false alarms.
	2. Distinguish each syscall context from another *for arm64*.
	3. Make it not warn it but just print it in case Dept ring
	   buffer gets exhausted. (feedback from Hyeonggon)
	4. Explicitely describe "EXPERIMENTAL" and "Dept might produce
	   false positive reports" in Kconfig. (feedback from Ted)

Changes from v3:

	1. Dept shouldn't create dependencies between different depths
	   of a class that were indicated by *_lock_nested(). Dept
	   normally doesn't but it does once another lock class comes
	   in. So fixed it. (feedback from Hyeonggon)
	2. Dept considered a wait as a real wait once getting to
	   __schedule() even if it has been set to TASK_RUNNING by wake
	   up sources in advance. Fixed it so that Dept doesn't consider
	   the case as a real wait. (feedback from Jan Kara)
	3. Stop tracking dependencies with a map once the event
	   associated with the map has been handled. Dept will start to
	   work with the map again, on the next sleep.

Changes from v2:

	1. Disable Dept on bit_wait_table[] in sched/wait_bit.c
	   reporting a lot of false positives, which is my fault.
	   Wait/event for bit_wait_table[] should've been tagged in a
	   higher layer for better work, which is a future work.
	   (feedback from Jan Kara)
	2. Disable Dept on crypto_larval's completion to prevent a false
	   positive.

Changes from v1:

	1. Fix coding style and typo. (feedback from Steven)
	2. Distinguish each work context from another in workqueue.
	3. Skip checking lock acquisition with nest_lock, which is about
	   correct lock usage that should be checked by Lockdep.

Changes from RFC(v0):

	1. Prevent adding a wait tag at prepare_to_wait() but __schedule().
	   (feedback from Linus and Matthew)
	2. Use try version at lockdep_acquire_cpus_lock() annotation.
	3. Distinguish each syscall context from another.

Byungchul Park (21):
  llist: Move llist_{head,node} definition to types.h
  dept: Implement Dept(Dependency Tracker)
  dept: Apply Dept to spinlock
  dept: Apply Dept to mutex families
  dept: Apply Dept to rwlock
  dept: Apply Dept to wait_for_completion()/complete()
  dept: Apply Dept to seqlock
  dept: Apply Dept to rwsem
  dept: Add proc knobs to show stats and dependency graph
  dept: Introduce split map concept and new APIs for them
  dept: Apply Dept to wait/event of PG_{locked,writeback}
  dept: Apply SDT to swait
  dept: Apply SDT to wait(waitqueue)
  locking/lockdep, cpu/hotplus: Use a weaker annotation in AP thread
  dept: Distinguish each syscall context from another
  dept: Distinguish each work from another
  dept: Disable Dept within the wait_bit layer by default
  dept: Disable Dept on struct crypto_larval's completion for now
  dept: Differentiate onstack maps from others of different tasks in
    class
  dept: Do not add dependencies between events within scheduler and
    sleeps
  dept: Unstage wait when tagging a normal sleep wait

 arch/arm64/kernel/syscall.c        |    2 +
 arch/x86/entry/common.c            |    4 +
 crypto/api.c                       |    7 +-
 include/linux/completion.h         |   44 +-
 include/linux/dept.h               |  596 ++++++++
 include/linux/dept_page.h          |   78 +
 include/linux/dept_sdt.h           |   67 +
 include/linux/hardirq.h            |    3 +
 include/linux/irqflags.h           |   71 +-
 include/linux/llist.h              |    8 -
 include/linux/lockdep.h            |  186 ++-
 include/linux/lockdep_types.h      |    3 +
 include/linux/mutex.h              |   22 +
 include/linux/page-flags.h         |   45 +-
 include/linux/pagemap.h            |    7 +-
 include/linux/percpu-rwsem.h       |    4 +-
 include/linux/rtmutex.h            |    1 +
 include/linux/rwlock.h             |   42 +
 include/linux/rwlock_api_smp.h     |    8 +-
 include/linux/rwlock_types.h       |    1 +
 include/linux/rwsem.h              |   22 +
 include/linux/sched.h              |    7 +
 include/linux/seqlock.h            |   60 +-
 include/linux/spinlock.h           |   21 +
 include/linux/spinlock_types_raw.h |    3 +
 include/linux/swait.h              |    4 +
 include/linux/types.h              |    8 +
 include/linux/wait.h               |    6 +-
 init/init_task.c                   |    2 +
 init/main.c                        |    4 +
 kernel/Makefile                    |    1 +
 kernel/cpu.c                       |    2 +-
 kernel/dependency/Makefile         |    4 +
 kernel/dependency/dept.c           | 2938 ++++++++++++++++++++++++++++++++++++
 kernel/dependency/dept_hash.h      |   10 +
 kernel/dependency/dept_internal.h  |   26 +
 kernel/dependency/dept_object.h    |   13 +
 kernel/dependency/dept_proc.c      |   92 ++
 kernel/exit.c                      |    7 +
 kernel/fork.c                      |    2 +
 kernel/locking/lockdep.c           |   28 +-
 kernel/locking/spinlock_rt.c       |   24 +-
 kernel/module.c                    |    2 +
 kernel/sched/completion.c          |   12 +-
 kernel/sched/core.c                |   10 +
 kernel/sched/swait.c               |   10 +
 kernel/sched/wait.c                |   16 +
 kernel/sched/wait_bit.c            |    5 +-
 kernel/workqueue.c                 |    3 +
 lib/Kconfig.debug                  |   28 +
 mm/filemap.c                       |   68 +
 mm/page_ext.c                      |    5 +
 52 files changed, 4558 insertions(+), 84 deletions(-)
 create mode 100644 include/linux/dept.h
 create mode 100644 include/linux/dept_page.h
 create mode 100644 include/linux/dept_sdt.h
 create mode 100644 kernel/dependency/Makefile
 create mode 100644 kernel/dependency/dept.c
 create mode 100644 kernel/dependency/dept_hash.h
 create mode 100644 kernel/dependency/dept_internal.h
 create mode 100644 kernel/dependency/dept_object.h
 create mode 100644 kernel/dependency/dept_proc.c

Comments

Linus Torvalds May 4, 2022, 6:17 p.m. UTC | #1
On Wed, May 4, 2022 at 1:19 AM Byungchul Park <byungchul.park@lge.com> wrote:
>
> Hi Linus and folks,
>
> I've been developing a tool for detecting deadlock possibilities by
> tracking wait/event rather than lock(?) acquisition order to try to
> cover all synchonization machanisms.

So what is the actual status of reports these days?

Last time I looked at some reports, it gave a lot of false positives
due to mis-understanding prepare_to_sleep().

For this all to make sense, it would need to not have false positives
(or at least a very small number of them together with a way to sanely
get rid of them), and also have a track record of finding things that
lockdep doesn't.

Maybe such reports have been sent out with the current situation, and
I haven't seen them.

                 Linus
Byungchul Park May 6, 2022, 12:11 a.m. UTC | #2
Linus wrote:
>
> On Wed, May 4, 2022 at 1:19 AM Byungchul Park <byungchul.park@lge.com> wrote:
> >
> > Hi Linus and folks,
> >
> > I've been developing a tool for detecting deadlock possibilities by
> > tracking wait/event rather than lock(?) acquisition order to try to
> > cover all synchonization machanisms.
> 
> So what is the actual status of reports these days?
> 
> Last time I looked at some reports, it gave a lot of false positives
> due to mis-understanding prepare_to_sleep().

Yes, it was. I handled the case in the following way:

1. Stage the wait at prepare_to_sleep(), which might be used at commit.
   Which has yet to be an actual wait that Dept considers.
2. If the condition for sleep is true, the wait will be committed at
   __schedule(). The wait becomes an actual one that Dept considers.
3. If the condition is false and the task gets back to TASK_RUNNING,
   clean(=reset) the staged wait.

That way, Dept only works with what actually hits to __schedule() for
the waits through sleep.

> For this all to make sense, it would need to not have false positives
> (or at least a very small number of them together with a way to sanely

Yes. I agree with you. I got rid of them that way I described above.

> get rid of them), and also have a track record of finding things that
> lockdep doesn't.

I have some reports that wait_for_completion or waitqueue is involved.
It's worth noting those are not tracked by Lockdep. I'm checking if
those are true positive or not. I will share those reports once I get
more convinced for that.

> Maybe such reports have been sent out with the current situation, and
> I haven't seen them.

Dept reports usually have been sent to me privately, not in LKML. As I
told you, I'm planning to share them.

	Byungchul

> 
>                  Linus
>
Hyeonggon Yoo May 7, 2022, 7:20 a.m. UTC | #3
On Fri, May 06, 2022 at 09:11:35AM +0900, Byungchul Park wrote:
> Linus wrote:
> >
> > On Wed, May 4, 2022 at 1:19 AM Byungchul Park <byungchul.park@lge.com> wrote:
> > >
> > > Hi Linus and folks,
> > >
> > > I've been developing a tool for detecting deadlock possibilities by
> > > tracking wait/event rather than lock(?) acquisition order to try to
> > > cover all synchonization machanisms.
> > 
> > So what is the actual status of reports these days?
> > 
> > Last time I looked at some reports, it gave a lot of false positives
> > due to mis-understanding prepare_to_sleep().
> 
> Yes, it was. I handled the case in the following way:
> 
> 1. Stage the wait at prepare_to_sleep(), which might be used at commit.
>    Which has yet to be an actual wait that Dept considers.
> 2. If the condition for sleep is true, the wait will be committed at
>    __schedule(). The wait becomes an actual one that Dept considers.
> 3. If the condition is false and the task gets back to TASK_RUNNING,
>    clean(=reset) the staged wait.
> 
> That way, Dept only works with what actually hits to __schedule() for
> the waits through sleep.
> 
> > For this all to make sense, it would need to not have false positives
> > (or at least a very small number of them together with a way to sanely
> 
> Yes. I agree with you. I got rid of them that way I described above.
>

IMHO DEPT should not report what lockdep allows (Not talking about
wait events). I mean lockdep allows some kind of nested locks but
DEPT reports them.

When I was collecting reports from DEPT on varous configurations,
Most of them was report of down_write_nested(), which is allowed in
lockdep.

DEPT should not report at least what we know it's not a real deadlock.
Otherwise there will be reports that is never fixed, which is quite
unpleasant and reporters cannot examine all of them if it's real deadlock
or not.

> > get rid of them), and also have a track record of finding things that
> > lockdep doesn't.
> 
> I have some reports that wait_for_completion or waitqueue is involved.
> It's worth noting those are not tracked by Lockdep. I'm checking if
> those are true positive or not. I will share those reports once I get
> more convinced for that.
> 
> > Maybe such reports have been sent out with the current situation, and
> > I haven't seen them.
> 
> Dept reports usually have been sent to me privately, not in LKML. As I
> told you, I'm planning to share them.
> 
> 	Byungchul
> 
> > 
> >                  Linus
> >
Byungchul Park May 9, 2022, 12:16 a.m. UTC | #4
On Sat, May 07, 2022 at 04:20:50PM +0900, Hyeonggon Yoo wrote:
> On Fri, May 06, 2022 at 09:11:35AM +0900, Byungchul Park wrote:
> > Linus wrote:
> > >
> > > On Wed, May 4, 2022 at 1:19 AM Byungchul Park <byungchul.park@lge.com> wrote:
> > > >
> > > > Hi Linus and folks,
> > > >
> > > > I've been developing a tool for detecting deadlock possibilities by
> > > > tracking wait/event rather than lock(?) acquisition order to try to
> > > > cover all synchonization machanisms.
> > > 
> > > So what is the actual status of reports these days?
> > > 
> > > Last time I looked at some reports, it gave a lot of false positives
> > > due to mis-understanding prepare_to_sleep().
> > 
> > Yes, it was. I handled the case in the following way:
> > 
> > 1. Stage the wait at prepare_to_sleep(), which might be used at commit.
> >    Which has yet to be an actual wait that Dept considers.
> > 2. If the condition for sleep is true, the wait will be committed at
> >    __schedule(). The wait becomes an actual one that Dept considers.
> > 3. If the condition is false and the task gets back to TASK_RUNNING,
> >    clean(=reset) the staged wait.
> > 
> > That way, Dept only works with what actually hits to __schedule() for
> > the waits through sleep.
> > 
> > > For this all to make sense, it would need to not have false positives
> > > (or at least a very small number of them together with a way to sanely
> > 
> > Yes. I agree with you. I got rid of them that way I described above.
> >
> 
> IMHO DEPT should not report what lockdep allows (Not talking about

No.

> wait events). I mean lockdep allows some kind of nested locks but
> DEPT reports them.

You have already asked exactly same question in another thread of
LKML. That time I answered to it but let me explain it again.

---

CASE 1.

   lock L with depth n
   lock_nested L' with depth n + 1
   ...
   unlock L'
   unlock L

This case is allowed by Lockdep.
This case is allowed by DEPT cuz it's not a deadlock.

CASE 2.

   lock L with depth n
   lock A
   lock_nested L' with depth n + 1
   ...
   unlock L'
   unlock A
   unlock L

This case is allowed by Lockdep.
This case is *NOT* allowed by DEPT cuz it's a *DEADLOCK*.

---

The following scenario would explain why CASE 2 is problematic.

   THREAD X			THREAD Y

   lock L with depth n
				lock L' with depth n
   lock A
				lock A
   lock_nested L' with depth n + 1
				lock_nested L'' with depth n + 1
   ...				...
   unlock L'			unlock L''
   unlock A			unlock A
   unlock L			unlock L'

Yes. I need to check if the report you shared with me is a true one, but
it's not because DEPT doesn't work with *_nested() APIs.

	Byungchul
Byungchul Park May 9, 2022, 1:22 a.m. UTC | #5
On Wed, May 04, 2022 at 11:17:02AM -0700, Linus Torvalds wrote:
> On Wed, May 4, 2022 at 1:19 AM Byungchul Park <byungchul.park@lge.com> wrote:
> >
> > Hi Linus and folks,
> >
> > I've been developing a tool for detecting deadlock possibilities by
> > tracking wait/event rather than lock(?) acquisition order to try to
> > cover all synchonization machanisms.
> 
> So what is the actual status of reports these days?

I'd like to mention one important thing here. Reportability would get
stronger if the more wait-event pairs get tagged everywhere DEPT can
work.

Everything e.g. HW-SW interface, any retry logic and so on can be a
wait-event pair if they work wait or event anyway. For example, polling
on an IO mapped read register and initiating the HW to go for the event
also can be a pair. Definitely those make DEPT more useful.

---

The way to use the APIs:

1. Define SDT(Simple Dependency Tracker)

   DEFINE_DEPT_SDT(my_hw_event); <- add this

2. Tag on the waits

   sdt_wait(&my_hw_event); <- add this
   ... retry logic until my hw work done ... <- the original code

3. Tag on the events

   sdt_event(&my_hw_event); <- add this
   run_my_hw(); <- the original code

---

These are all we should do. I believe DEPT would be a very useful tool
once all wait-event pairs get tagged by the developers in all subsystems
and device drivers.

	Byungchul

> Last time I looked at some reports, it gave a lot of false positives
> due to mis-understanding prepare_to_sleep().
> 
> For this all to make sense, it would need to not have false positives
> (or at least a very small number of them together with a way to sanely
> get rid of them), and also have a track record of finding things that
> lockdep doesn't.
> 
> Maybe such reports have been sent out with the current situation, and
> I haven't seen them.
> 
>                  Linus
Steven Rostedt May 9, 2022, 8:47 p.m. UTC | #6
On Mon, 9 May 2022 09:16:37 +0900
Byungchul Park <byungchul.park@lge.com> wrote:

> CASE 2.
> 
>    lock L with depth n
>    lock A
>    lock_nested L' with depth n + 1
>    ...
>    unlock L'
>    unlock A
>    unlock L
> 
> This case is allowed by Lockdep.
> This case is *NOT* allowed by DEPT cuz it's a *DEADLOCK*.
> 
> ---
> 
> The following scenario would explain why CASE 2 is problematic.
> 
>    THREAD X			THREAD Y
> 
>    lock L with depth n
> 				lock L' with depth n
>    lock A
> 				lock A
>    lock_nested L' with depth n + 1

I'm confused by what exactly you are saying is a deadlock above.

Are you saying that lock A and L' are inversed? If so, lockdep had better
detect that regardless of L. A nested lock associates the the nesting with
the same type of lock. That is, in lockdep nested tells lockdep not to
trigger on the L and L' but it will not ignore that A was taken.

-- Steve



> 				lock_nested L'' with depth n + 1
>    ...				...
>    unlock L'			unlock L''
>    unlock A			unlock A
>    unlock L			unlock L'
Theodore Ts'o May 9, 2022, 9:05 p.m. UTC | #7
I tried DEPT-v6 applied against 5.18-rc5, and it reported the
following positive.

The reason why it's nonsense is that in context A's [W] wait:

[ 1538.545054] [W] folio_wait_bit_common(pglocked:0):
[ 1538.545370] [<ffffffff81259944>] __filemap_get_folio+0x3e4/0x420
[ 1538.545763] stacktrace:
[ 1538.545928]       folio_wait_bit_common+0x2fa/0x460
[ 1538.546248]       __filemap_get_folio+0x3e4/0x420
[ 1538.546558]       pagecache_get_page+0x11/0x40
[ 1538.546852]       ext4_mb_init_group+0x80/0x2e0
[ 1538.547152]       ext4_mb_good_group_nolock+0x2a3/0x2d0

... we're reading the block allocation bitmap into the page cache.
This does not correspond to a real inode, and so we don't actually
take ei->i_data_sem in this on the psuedo-inode used.

In contast, context's B's [W] and [E]'s stack traces, the
folio_wait_bit is clearly associated with page which is mapped to a
real inode:

[ 1538.553656] [W] down_write(&ei->i_data_sem:0):
[ 1538.553948] [<ffffffff8141c01b>] ext4_map_blocks+0x17b/0x680
[ 1538.554320] stacktrace:
[ 1538.554485]       ext4_map_blocks+0x17b/0x680
[ 1538.554772]       mpage_map_and_submit_extent+0xef/0x530
[ 1538.555122]       ext4_writepages+0x798/0x990
[ 1538.555409]       do_writepages+0xcf/0x1c0
[ 1538.555682]       __writeback_single_inode+0x58/0x3f0
[ 1538.556014]       writeback_sb_inodes+0x210/0x540
  		     ...

[ 1538.558621] [E] folio_wake_bit(pglocked:0):
[ 1538.558896] [<ffffffff814418c0>] ext4_bio_write_page+0x400/0x560
[ 1538.559290] stacktrace:
[ 1538.559455]       ext4_bio_write_page+0x400/0x560
[ 1538.559765]       mpage_submit_page+0x5c/0x80
[ 1538.560051]       mpage_map_and_submit_buffers+0x15a/0x250
[ 1538.560409]       mpage_map_and_submit_extent+0x134/0x530
[ 1538.560764]       ext4_writepages+0x798/0x990
[ 1538.561057]       do_writepages+0xcf/0x1c0
[ 1538.561329]       __writeback_single_inode+0x58/0x3f0
		...


In any case, this will ***never*** deadlock, and it's due to DEPT
fundamentally not understanding that waiting on different pages may be
due to inodes that come from completely different inodes, and so there
is zero possible chance this would never deadlock.

I suspect there will be similar false positives for tests (or
userspace) that uses copy_file_range(2) or send_file(2) system calls.

I've included the full DEPT log report below.

						- Ted

generic/011		[20:11:16][ 1533.411773] run fstests generic/011 at 2022-05-07 20:11:16
[ 1533.509603] DEPT_INFO_ONCE: Need to expand the ring buffer.
[ 1536.910044] DEPT_INFO_ONCE: Pool(wait) is empty.
[ 1538.533315] ===================================================
[ 1538.533793] DEPT: Circular dependency has been detected.
[ 1538.534199] 5.18.0-rc5-xfstests-dept-00021-g8d3d751c9964 #571 Not tainted
[ 1538.534645] ---------------------------------------------------
[ 1538.535035] summary
[ 1538.535177] ---------------------------------------------------
[ 1538.535567] *** DEADLOCK ***
[ 1538.535567] 
[ 1538.535854] context A
[ 1538.536008]     [S] down_write(&ei->i_data_sem:0)
[ 1538.536323]     [W] folio_wait_bit_common(pglocked:0)
[ 1538.536655]     [E] up_write(&ei->i_data_sem:0)
[ 1538.536958] 
[ 1538.537063] context B
[ 1538.537216]     [S] (unknown)(pglocked:0)
[ 1538.537480]     [W] down_write(&ei->i_data_sem:0)
[ 1538.537789]     [E] folio_wake_bit(pglocked:0)
[ 1538.538082] 
[ 1538.538184] [S]: start of the event context
[ 1538.538460] [W]: the wait blocked
[ 1538.538680] [E]: the event not reachable
[ 1538.538939] ---------------------------------------------------
[ 1538.539327] context A's detail
[ 1538.539530] ---------------------------------------------------
[ 1538.539918] context A
[ 1538.540072]     [S] down_write(&ei->i_data_sem:0)
[ 1538.540382]     [W] folio_wait_bit_common(pglocked:0)
[ 1538.540712]     [E] up_write(&ei->i_data_sem:0)
[ 1538.541015] 
[ 1538.541119] [S] down_write(&ei->i_data_sem:0):
[ 1538.541410] [<ffffffff8141c01b>] ext4_map_blocks+0x17b/0x680
[ 1538.541782] stacktrace:
[ 1538.541946]       ext4_map_blocks+0x17b/0x680
[ 1538.542234]       ext4_getblk+0x5f/0x1f0
[ 1538.542493]       ext4_bread+0xc/0x70
[ 1538.542736]       ext4_append+0x48/0xf0
[ 1538.542991]       ext4_init_new_dir+0xc8/0x160
[ 1538.543284]       ext4_mkdir+0x19a/0x320
[ 1538.543542]       vfs_mkdir+0x83/0xe0
[ 1538.543788]       do_mkdirat+0x8c/0x130
[ 1538.544042]       __x64_sys_mkdir+0x29/0x30
[ 1538.544319]       do_syscall_64+0x40/0x90
[ 1538.544584]       entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1538.544949] 
[ 1538.545054] [W] folio_wait_bit_common(pglocked:0):
[ 1538.545370] [<ffffffff81259944>] __filemap_get_folio+0x3e4/0x420
[ 1538.545763] stacktrace:
[ 1538.545928]       folio_wait_bit_common+0x2fa/0x460
[ 1538.546248]       __filemap_get_folio+0x3e4/0x420
[ 1538.546558]       pagecache_get_page+0x11/0x40
[ 1538.546852]       ext4_mb_init_group+0x80/0x2e0
[ 1538.547152]       ext4_mb_good_group_nolock+0x2a3/0x2d0
[ 1538.547496]       ext4_mb_regular_allocator+0x391/0x780
[ 1538.547840]       ext4_mb_new_blocks+0x44e/0x720
[ 1538.548145]       ext4_ext_map_blocks+0x7f1/0xd00
[ 1538.548455]       ext4_map_blocks+0x19e/0x680
[ 1538.548743]       ext4_getblk+0x5f/0x1f0
[ 1538.549006]       ext4_bread+0xc/0x70
[ 1538.549250]       ext4_append+0x48/0xf0
[ 1538.549505]       ext4_init_new_dir+0xc8/0x160
[ 1538.549798]       ext4_mkdir+0x19a/0x320
[ 1538.550058]       vfs_mkdir+0x83/0xe0
[ 1538.550302]       do_mkdirat+0x8c/0x130
[ 1538.550557] 
[ 1538.550660] [E] up_write(&ei->i_data_sem:0):
[ 1538.550940] (N/A)
[ 1538.551071] ---------------------------------------------------
[ 1538.551459] context B's detail
[ 1538.551662] ---------------------------------------------------
[ 1538.552047] context B
[ 1538.552202]     [S] (unknown)(pglocked:0)
[ 1538.552466]     [W] down_write(&ei->i_data_sem:0)
[ 1538.552775]     [E] folio_wake_bit(pglocked:0)
[ 1538.553071] 
[ 1538.553174] [S] (unknown)(pglocked:0):
[ 1538.553422] (N/A)
[ 1538.553553] 
[ 1538.553656] [W] down_write(&ei->i_data_sem:0):
[ 1538.553948] [<ffffffff8141c01b>] ext4_map_blocks+0x17b/0x680
[ 1538.554320] stacktrace:
[ 1538.554485]       ext4_map_blocks+0x17b/0x680
[ 1538.554772]       mpage_map_and_submit_extent+0xef/0x530
[ 1538.555122]       ext4_writepages+0x798/0x990
[ 1538.555409]       do_writepages+0xcf/0x1c0
[ 1538.555682]       __writeback_single_inode+0x58/0x3f0
[ 1538.556014]       writeback_sb_inodes+0x210/0x540
[ 1538.556324]       __writeback_inodes_wb+0x4c/0xe0
[ 1538.556635]       wb_writeback+0x298/0x450
[ 1538.556911]       wb_do_writeback+0x29e/0x320
[ 1538.557199]       wb_workfn+0x6a/0x2c0
[ 1538.557447]       process_one_work+0x302/0x650
[ 1538.557743]       worker_thread+0x55/0x400
[ 1538.558013]       kthread+0xf0/0x120
[ 1538.558251]       ret_from_fork+0x1f/0x30
[ 1538.558518] 
[ 1538.558621] [E] folio_wake_bit(pglocked:0):
[ 1538.558896] [<ffffffff814418c0>] ext4_bio_write_page+0x400/0x560
[ 1538.559290] stacktrace:
[ 1538.559455]       ext4_bio_write_page+0x400/0x560
[ 1538.559765]       mpage_submit_page+0x5c/0x80
[ 1538.560051]       mpage_map_and_submit_buffers+0x15a/0x250
[ 1538.560409]       mpage_map_and_submit_extent+0x134/0x530
[ 1538.560764]       ext4_writepages+0x798/0x990
[ 1538.561057]       do_writepages+0xcf/0x1c0
[ 1538.561329]       __writeback_single_inode+0x58/0x3f0
[ 1538.561662]       writeback_sb_inodes+0x210/0x540
[ 1538.561973]       __writeback_inodes_wb+0x4c/0xe0
[ 1538.562283]       wb_writeback+0x298/0x450
[ 1538.562555]       wb_do_writeback+0x29e/0x320
[ 1538.562842]       wb_workfn+0x6a/0x2c0
[ 1538.563095]       process_one_work+0x302/0x650
[ 1538.563387]       worker_thread+0x55/0x400
[ 1538.563658]       kthread+0xf0/0x120
[ 1538.563895]       ret_from_fork+0x1f/0x30
[ 1538.564161] ---------------------------------------------------
[ 1538.564548] information that might be helpful
[ 1538.564832] ---------------------------------------------------
[ 1538.565223] CPU: 1 PID: 46539 Comm: dirstress Not tainted 5.18.0-rc5-xfstests-dept-00021-g8d3d751c9964 #571
[ 1538.565854] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[ 1538.566394] Call Trace:
[ 1538.566559]  <TASK>
[ 1538.566701]  dump_stack_lvl+0x4f/0x68
[ 1538.566945]  print_circle.cold+0x15b/0x169
[ 1538.567218]  ? print_circle+0xe0/0xe0
[ 1538.567461]  cb_check_dl+0x55/0x60
[ 1538.567687]  bfs+0xd5/0x1b0
[ 1538.567874]  add_dep+0xd3/0x1a0
[ 1538.568083]  ? __filemap_get_folio+0x3e4/0x420
[ 1538.568374]  add_wait+0xe3/0x250
[ 1538.568590]  ? __filemap_get_folio+0x3e4/0x420
[ 1538.568886]  dept_wait_split_map+0xb1/0x130
[ 1538.569163]  folio_wait_bit_common+0x2fa/0x460
[ 1538.569456]  ? lock_is_held_type+0xfc/0x130
[ 1538.569733]  __filemap_get_folio+0x3e4/0x420
[ 1538.570013]  ? __lock_release+0x1b2/0x2c0
[ 1538.570278]  pagecache_get_page+0x11/0x40
[ 1538.570543]  ext4_mb_init_group+0x80/0x2e0
[ 1538.570813]  ? ext4_get_group_desc+0xb2/0x200
[ 1538.571102]  ext4_mb_good_group_nolock+0x2a3/0x2d0
[ 1538.571418]  ext4_mb_regular_allocator+0x391/0x780
[ 1538.571733]  ? rcu_read_lock_sched_held+0x3f/0x70
[ 1538.572044]  ? trace_kmem_cache_alloc+0x2c/0xd0
[ 1538.572343]  ? kmem_cache_alloc+0x1f7/0x3f0
[ 1538.572618]  ext4_mb_new_blocks+0x44e/0x720
[ 1538.572896]  ext4_ext_map_blocks+0x7f1/0xd00
[ 1538.573179]  ? find_held_lock+0x2b/0x80
[ 1538.573434]  ext4_map_blocks+0x19e/0x680
[ 1538.573693]  ext4_getblk+0x5f/0x1f0
[ 1538.573927]  ext4_bread+0xc/0x70
[ 1538.574141]  ext4_append+0x48/0xf0
[ 1538.574369]  ext4_init_new_dir+0xc8/0x160
[ 1538.574634]  ext4_mkdir+0x19a/0x320
[ 1538.574866]  vfs_mkdir+0x83/0xe0
[ 1538.575082]  do_mkdirat+0x8c/0x130
[ 1538.575308]  __x64_sys_mkdir+0x29/0x30
[ 1538.575557]  do_syscall_64+0x40/0x90
[ 1538.575795]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1538.576128] RIP: 0033:0x7f0960466b07
[ 1538.576367] Code: 1f 40 00 48 8b 05 89 f3 0c 00 64 c7 00 5f 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 53 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 59 f3 0c 00 f7 d8 64 89 01 48
[ 1538.577576] RSP: 002b:00007ffd0fa955a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
[ 1538.578069] RAX: ffffffffffffffda RBX: 0000000000000239 RCX: 00007f0960466b07
[ 1538.578533] RDX: 0000000000000000 RSI: 00000000000001ff RDI: 00007ffd0fa955d0
[ 1538.578995] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000010
[ 1538.579458] R10: 00007ffd0fa95345 R11: 0000000000000246 R12: 00000000000003e8
[ 1538.579923] R13: 0000000000000000 R14: 00007ffd0fa955d0 R15: 00007ffd0fa95dd0
[ 1538.580389]  </TASK>
[ 1540.581382] EXT4-fs (vdb): mounted filesystem with ordered data mode. Quota mode: none.
 [20:11:24] 8s


P.S.  Later on the console, the test ground to the halt because DEPT
started WARNING over and over and over again....

[ 3129.686102] DEPT_WARN_ON: dt->ecxt_held_pos == DEPT_MAX_ECXT_HELD
[ 3129.686396]  ? __might_fault+0x32/0x80
[ 3129.686660] WARNING: CPU: 1 PID: 107320 at kernel/dependency/dept.c:1537 add_ecxt+0x1c0/0x1d0
[ 3129.687040]  ? __might_fault+0x32/0x80
[ 3129.687282] CPU: 1 PID: 107320 Comm: aio-stress Tainted: G        W         5.18.0-rc5-xfstests-dept-00021-g8d3d751c9964 #571

with multiple CPU's completely spamming the serial console.  This
should probably be a WARN_ON_ONCE, or some thing that disables DEPT
entirely, since apparently won't be any useful DEPT reports (or any
useful kernel work, for that matteR) is going to be happening after
this.
Theodore Ts'o May 9, 2022, 10:28 p.m. UTC | #8
Oh, one other problem with DEPT --- it's SLOW --- the overhead is
enormous.  Using kvm-xfstests[1] running "kvm-xfstests smoke", here
are some sample times:

			LOCKDEP		DEPT
Time to first test	49 seconds	602 seconds
ext4/001      		2 s		22 s
ext4/003		2 s		8 s
ext4/005		0 s		7 s
ext4/020		1 s		8 s
ext4/021		11 s		17 s
ext4/023		0 s		83 s
generic/001		4 s		76 s
generic/002		0 s		11 s
generic/003		10 s		19 s

There are some large variations; in some cases, some xfstests take 10x
as much time or more to run.  In fact, when I first started the
kvm-xfstests run with DEPT, I thought something had hung and that
tests would never start.  (In fact, with gce-xfstests the default
watchdog "something has gone terribly wrong with the kexec" had fired,
and I didn't get any test results using gce-xfstests at all.  If DEPT
goes in without any optimizations, I'm going to have to adjust the
watchdogs timers for gce-xfstests.)

The bottom line is that at the moment, between the false positives,
and the significant overhead imposed by DEPT, I would suggest that if
DEPT ever does go in, that it should be possible to disable DEPT and
only use the existing CONFIG_PROVE_LOCKING version of LOCKDEP, just
because DEPT is S - L - O - W.

[1] https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-quickstart.md

						- Ted

P.S.  Darrick and I both have disabled using LOCKDEP by default
because it slows down ext4 -g auto testing by a factor 2, and xfs -g
auto testing by a factor of 3.  So the fact that DEPT is a factor of
2x to 10x or more slower than LOCKDEP when running various xfstests
tests should be a real concern.
Byungchul Park May 9, 2022, 11:38 p.m. UTC | #9
On Mon, May 09, 2022 at 04:47:12PM -0400, Steven Rostedt wrote:
> On Mon, 9 May 2022 09:16:37 +0900
> Byungchul Park <byungchul.park@lge.com> wrote:
> 
> > CASE 2.
> > 
> >    lock L with depth n
> >    lock A
> >    lock_nested L' with depth n + 1
> >    ...
> >    unlock L'
> >    unlock A
> >    unlock L
> > 
> > This case is allowed by Lockdep.
> > This case is *NOT* allowed by DEPT cuz it's a *DEADLOCK*.
> > 
> > ---
> > 
> > The following scenario would explain why CASE 2 is problematic.
> > 
> >    THREAD X			THREAD Y
> > 
> >    lock L with depth n
> > 				lock L' with depth n
> >    lock A
> > 				lock A
> >    lock_nested L' with depth n + 1
> 
> I'm confused by what exactly you are saying is a deadlock above.
> 
> Are you saying that lock A and L' are inversed? If so, lockdep had better

Hi Steven,

Yes, I was talking about A and L'.

> detect that regardless of L. A nested lock associates the the nesting with

When I checked Lockdep code, L' with depth n + 1 and L' with depth n
have different classes in Lockdep.

That's why I said Lockdep cannot detect it. By any chance, has it
changed so as to consider this case? Or am I missing something?

> the same type of lock. That is, in lockdep nested tells lockdep not to
> trigger on the L and L' but it will not ignore that A was taken.

It will not ignore A but it would work like this:

   THREAD X			THREAD Y

   lock Ln
				lock Ln
   lock A
				lock A
   lock_nested Lm
				lock_nested Lm

So, Lockdep considers this case safe, actually not tho.

	Byungchul

> 
> -- Steve
> 
> 
> 
> > 				lock_nested L'' with depth n + 1
> >    ...				...
> >    unlock L'			unlock L''
> >    unlock A			unlock A
> >    unlock L			unlock L'
Byungchul Park May 10, 2022, 12:32 a.m. UTC | #10
On Mon, May 09, 2022 at 06:28:17PM -0400, Theodore Ts'o wrote:
> Oh, one other problem with DEPT --- it's SLOW --- the overhead is
> enormous.  Using kvm-xfstests[1] running "kvm-xfstests smoke", here
> are some sample times:

Yes, right. DEPT has never been optimized. It rather turns on
CONFIG_LOCKDEP and even CONFIG_PROVE_LOCKING when CONFIG_DEPT gets on
because of porting issue. I have no choice but to rely on those to
develop DEPT out of tree. Of course, that's what I don't like.

Plus, for now, I'm focusing on removing false positives. Once it's
considered settled down, I will work on performance optimizaition. But
it should still keep relying on Lockdep CONFIGs and adding additional
overhead on it until DEPT can be developed in the tree.

> 			LOCKDEP		DEPT
> Time to first test	49 seconds	602 seconds
> ext4/001      		2 s		22 s
> ext4/003		2 s		8 s
> ext4/005		0 s		7 s
> ext4/020		1 s		8 s
> ext4/021		11 s		17 s
> ext4/023		0 s		83 s
> generic/001		4 s		76 s
> generic/002		0 s		11 s
> generic/003		10 s		19 s
> 
> There are some large variations; in some cases, some xfstests take 10x
> as much time or more to run.  In fact, when I first started the
> kvm-xfstests run with DEPT, I thought something had hung and that
> tests would never start.  (In fact, with gce-xfstests the default
> watchdog "something has gone terribly wrong with the kexec" had fired,
> and I didn't get any test results using gce-xfstests at all.  If DEPT
> goes in without any optimizations, I'm going to have to adjust the
> watchdogs timers for gce-xfstests.)

Thank you for informing it. I will go for the optimization as well.

> The bottom line is that at the moment, between the false positives,
> and the significant overhead imposed by DEPT, I would suggest that if
> DEPT ever does go in, that it should be possible to disable DEPT and
> only use the existing CONFIG_PROVE_LOCKING version of LOCKDEP, just
> because DEPT is S - L - O - W.
> 
> [1] https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-quickstart.md
> 
> 						- Ted
> 
> P.S.  Darrick and I both have disabled using LOCKDEP by default
> because it slows down ext4 -g auto testing by a factor 2, and xfs -g
> auto testing by a factor of 3.  So the fact that DEPT is a factor of
> 2x to 10x or more slower than LOCKDEP when running various xfstests
> tests should be a real concern.

DEPT is tracking way more objects than Lockdep so it's inevitable to be
slower, but let me try to make it have the similar performance to
Lockdep.

	Byungchul
Theodore Ts'o May 10, 2022, 1:32 a.m. UTC | #11
On Tue, May 10, 2022 at 09:32:13AM +0900, Byungchul Park wrote:
> Yes, right. DEPT has never been optimized. It rather turns on
> CONFIG_LOCKDEP and even CONFIG_PROVE_LOCKING when CONFIG_DEPT gets on
> because of porting issue. I have no choice but to rely on those to
> develop DEPT out of tree. Of course, that's what I don't like.

Sure, but blaming the overhead on unnecessary CONFIG_PROVE_LOCKING
overhead can explain only a tiny fraction of the slowdown.  Consider:
if time to first test (time to boot the kernel, setup the test
environment, figure out which tests to run, etc.) is 12 seconds w/o
LOCKDEP, 49 seconds with LOCKDEP/PROVE_LOCKING and 602 seconds with
DEPT, you can really only blame 37 seconds out of the 602 seconds of
DEPT on unnecessary PROVE_LOCKING overhead.

So let's assume we can get rid of all of the PROVE_LOCKING overhead.
We're still talking about 12 seconds for time-to-first test without
any lock debugging, versus ** 565 ** seconds for time-to-first test
with DEPT.  That's a factor of 47x for DEPT sans LOCKDEP overhead,
compared to a 4x overhead for PROVE_LOCKING.

> Plus, for now, I'm focusing on removing false positives. Once it's
> considered settled down, I will work on performance optimizaition. But
> it should still keep relying on Lockdep CONFIGs and adding additional
> overhead on it until DEPT can be developed in the tree.

Well, please take a look at the false positive which I reported.  I
suspect that in order to fix that particular false positive, we'll
either need to have a way to disable DEPT on waiting on all page/folio
dirty bits, or it will need to treat pages from different inodes
and/or address spaces as being entirely separate classes, instead of
collapsing all inode dirty bits, and all of various inode's mutexes
(such as ext4's i_data_sem) as being part of a single object class.

> DEPT is tracking way more objects than Lockdep so it's inevitable to be
> slower, but let me try to make it have the similar performance to
> Lockdep.

In order to eliminate some of these false positives, I suspect it's
going to increase the number of object classes that DEPT will need to
track even *more*.  At which point, the cost/benefit of DEPT may get
called into question, especially if all of the false positives can't
be suppressed.

					- Ted
Byungchul Park May 10, 2022, 5:37 a.m. UTC | #12
Ted wrote:
> On Tue, May 10, 2022 at 09:32:13AM +0900, Byungchul Park wrote:
> > Yes, right. DEPT has never been optimized. It rather turns on
> > CONFIG_LOCKDEP and even CONFIG_PROVE_LOCKING when CONFIG_DEPT gets on
> > because of porting issue. I have no choice but to rely on those to
> > develop DEPT out of tree. Of course, that's what I don't like.
> 
> Sure, but blaming the overhead on unnecessary CONFIG_PROVE_LOCKING
> overhead can explain only a tiny fraction of the slowdown.  Consider:
> if time to first test (time to boot the kernel, setup the test
> environment, figure out which tests to run, etc.) is 12 seconds w/o
> LOCKDEP, 49 seconds with LOCKDEP/PROVE_LOCKING and 602 seconds with
> DEPT, you can really only blame 37 seconds out of the 602 seconds of
> DEPT on unnecessary PROVE_LOCKING overhead.
> 
> So let's assume we can get rid of all of the PROVE_LOCKING overhead.
> We're still talking about 12 seconds for time-to-first test without
> any lock debugging, versus ** 565 ** seconds for time-to-first test
> with DEPT.  That's a factor of 47x for DEPT sans LOCKDEP overhead,
> compared to a 4x overhead for PROVE_LOCKING.

Okay. I will work on it.

> > Plus, for now, I'm focusing on removing false positives. Once it's
> > considered settled down, I will work on performance optimizaition. But
> > it should still keep relying on Lockdep CONFIGs and adding additional
> > overhead on it until DEPT can be developed in the tree.
> 
> Well, please take a look at the false positive which I reported.  I
> suspect that in order to fix that particular false positive, we'll
> either need to have a way to disable DEPT on waiting on all page/folio
> dirty bits, or it will need to treat pages from different inodes
> and/or address spaces as being entirely separate classes, instead of
> collapsing all inode dirty bits, and all of various inode's mutexes
> (such as ext4's i_data_sem) as being part of a single object class.

I'd rather solve it by assigning different classes to different types of
inode. This is the right way.

> > DEPT is tracking way more objects than Lockdep so it's inevitable to be
> > slower, but let me try to make it have the similar performance to
> > Lockdep.
> 
> In order to eliminate some of these false positives, I suspect it's
> going to increase the number of object classes that DEPT will need to
> track even *more*.  At which point, the cost/benefit of DEPT may get
> called into question, especially if all of the false positives can't
> be suppressed.

Look. Let's talk in general terms. There's no way to get rid of the
false positives all the way. It's a decision issue for *balancing*
between considering potential cases and only real ones. Definitely,
potential is not real. The more potential things we consider, the higher
the chances are, that false positives appear.

But yes. The advantage we'd take by detecting potential ones should be
higher than the risk of being bothered by false ones. Do you think a
tool is useless if it produces a few false positives? Of course, it'd
be a problem if it's too many, but otherwise, I think it'd be a great
tool if the advantage > the risk.

Don't get me wrong here. It doesn't mean DEPT is perfect for now. The
performance should be improved and false alarms that appear should be
removed, of course. I'm talking about the direction.

For now, there's no tool to track wait/event itself in Linux kernel -
a subset of the functionality exists tho. DEPT is the 1st try for that
purpose and can be a useful tool by the right direction.

I know what you are concerning about. I bet it's false positives that
are going to bother you once merged. I'll insist that DEPT shouldn't be
used as a mandatory testing tool until considered stable enough. But
what about ones who would take the advantage use DEPT. Why don't you
think of folks who will take the advantage from the hints about
dependency of synchronization esp. when their subsystem requires very
complicated synchronization? Should a tool be useful only in a final
testing stage? What about the usefulness during development stage?

It's worth noting DEPT works with any wait/event so any lockups e.g.
even by HW-SW interface, retry logic or the like can be detected by DEPT
once all waits and events are tagged properly. I believe the advantage
by that is much higher than the bad side facing false alarms. It's just
my opinion. I'm goning to respect the majority opinion.

	Byungchul
> 
> 					- Ted
>
Hyeonggon Yoo May 10, 2022, 11:18 a.m. UTC | #13
On Mon, May 09, 2022 at 09:16:37AM +0900, Byungchul Park wrote:
> On Sat, May 07, 2022 at 04:20:50PM +0900, Hyeonggon Yoo wrote:
> > On Fri, May 06, 2022 at 09:11:35AM +0900, Byungchul Park wrote:
> > > Linus wrote:
> > > >
> > > > On Wed, May 4, 2022 at 1:19 AM Byungchul Park <byungchul.park@lge.com> wrote:
> > > > >
> > > > > Hi Linus and folks,
> > > > >
> > > > > I've been developing a tool for detecting deadlock possibilities by
> > > > > tracking wait/event rather than lock(?) acquisition order to try to
> > > > > cover all synchonization machanisms.
> > > > 
> > > > So what is the actual status of reports these days?
> > > > 
> > > > Last time I looked at some reports, it gave a lot of false positives
> > > > due to mis-understanding prepare_to_sleep().
> > > 
> > > Yes, it was. I handled the case in the following way:
> > > 
> > > 1. Stage the wait at prepare_to_sleep(), which might be used at commit.
> > >    Which has yet to be an actual wait that Dept considers.
> > > 2. If the condition for sleep is true, the wait will be committed at
> > >    __schedule(). The wait becomes an actual one that Dept considers.
> > > 3. If the condition is false and the task gets back to TASK_RUNNING,
> > >    clean(=reset) the staged wait.
> > > 
> > > That way, Dept only works with what actually hits to __schedule() for
> > > the waits through sleep.
> > > 
> > > > For this all to make sense, it would need to not have false positives
> > > > (or at least a very small number of them together with a way to sanely
> > > 
> > > Yes. I agree with you. I got rid of them that way I described above.
> > >
> > 
> > IMHO DEPT should not report what lockdep allows (Not talking about
> 
> No.
> 
> > wait events). I mean lockdep allows some kind of nested locks but
> > DEPT reports them.
> 
> You have already asked exactly same question in another thread of
> LKML. That time I answered to it but let me explain it again.
> 
> ---
> 
> CASE 1.
> 
>    lock L with depth n
>    lock_nested L' with depth n + 1
>    ...
>    unlock L'
>    unlock L
> 
> This case is allowed by Lockdep.
> This case is allowed by DEPT cuz it's not a deadlock.
> 
> CASE 2.
> 
>    lock L with depth n
>    lock A
>    lock_nested L' with depth n + 1
>    ...
>    unlock L'
>    unlock A
>    unlock L
> 
> This case is allowed by Lockdep.
> This case is *NOT* allowed by DEPT cuz it's a *DEADLOCK*.
>

Yeah, in previous threads we discussed this [1]

And the case was:
	scan_mutex -> object_lock -> kmemleak_lock -> object_lock
And dept reported:
	object_lock -> kmemleak_lock, kmemleak_lock -> object_lock as
	deadlock.

But IIUC - What DEPT reported happens only under scan_mutex and
It is not simple just not to take them because the object can be removed from the
list and freed while scanning via kmemleak_free() without kmemleak_lock and object_lock.

Just I'm still not sure that someone will fix the warning in the future - even if the
locking rule is not good - if it will not cause a real deadlock.

> ---
> 
> The following scenario would explain why CASE 2 is problematic.
> 
>    THREAD X			THREAD Y
> 
>    lock L with depth n
> 				lock L' with depth n
>    lock A
> 				lock A
>    lock_nested L' with depth n + 1
> 				lock_nested L'' with depth n + 1
>    ...				...
>    unlock L'			unlock L''
>    unlock A			unlock A
>    unlock L			unlock L'
> 
> Yes. I need to check if the report you shared with me is a true one, but
> it's not because DEPT doesn't work with *_nested() APIs.
>

Sorry, It was not right just to say DEPT doesn't work with _nested() APIs.

> 	Byungchul

[1] https://lore.kernel.org/lkml/20220304002809.GA6112@X58A-UD3R/
Steven Rostedt May 10, 2022, 2:12 p.m. UTC | #14
On Tue, 10 May 2022 08:38:38 +0900
Byungchul Park <byungchul.park@lge.com> wrote:

> Yes, I was talking about A and L'.
> 
> > detect that regardless of L. A nested lock associates the the nesting with  
> 
> When I checked Lockdep code, L' with depth n + 1 and L' with depth n
> have different classes in Lockdep.

If that's the case, then that's a bug in lockdep.

> 
> That's why I said Lockdep cannot detect it. By any chance, has it
> changed so as to consider this case? Or am I missing something?

No, it's not that lockdep cannot detect it, it should detect it. If it is
not detecting it, then we need to fix that.

-- Steve
Byungchul Park May 10, 2022, 11:26 p.m. UTC | #15
On Tue, May 10, 2022 at 10:12:54AM -0400, Steven Rostedt wrote:
> On Tue, 10 May 2022 08:38:38 +0900
> Byungchul Park <byungchul.park@lge.com> wrote:
> 
> > Yes, I was talking about A and L'.
> > 
> > > detect that regardless of L. A nested lock associates the the nesting with  
> > 
> > When I checked Lockdep code, L' with depth n + 1 and L' with depth n
> > have different classes in Lockdep.
> 
> If that's the case, then that's a bug in lockdep.

Yes, agree. I should've said 'Lockdep doesn't detect it currently.'
rather than 'Lockdep can't detect it.'.

I also think we make it for this case by fixing the bug in Lockdep.

> > 
> > That's why I said Lockdep cannot detect it. By any chance, has it
> > changed so as to consider this case? Or am I missing something?
> 
> No, it's not that lockdep cannot detect it, it should detect it. If it is
> not detecting it, then we need to fix that.

Yes.

	Byungchul
> 
> -- Steve
Byungchul Park May 10, 2022, 11:39 p.m. UTC | #16
On Tue, May 10, 2022 at 08:18:12PM +0900, Hyeonggon Yoo wrote:
> On Mon, May 09, 2022 at 09:16:37AM +0900, Byungchul Park wrote:
> > On Sat, May 07, 2022 at 04:20:50PM +0900, Hyeonggon Yoo wrote:
> > > On Fri, May 06, 2022 at 09:11:35AM +0900, Byungchul Park wrote:
> > > > Linus wrote:
> > > > >
> > > > > On Wed, May 4, 2022 at 1:19 AM Byungchul Park <byungchul.park@lge.com> wrote:
> > > > > >
> > > > > > Hi Linus and folks,
> > > > > >
> > > > > > I've been developing a tool for detecting deadlock possibilities by
> > > > > > tracking wait/event rather than lock(?) acquisition order to try to
> > > > > > cover all synchonization machanisms.
> > > > > 
> > > > > So what is the actual status of reports these days?
> > > > > 
> > > > > Last time I looked at some reports, it gave a lot of false positives
> > > > > due to mis-understanding prepare_to_sleep().
> > > > 
> > > > Yes, it was. I handled the case in the following way:
> > > > 
> > > > 1. Stage the wait at prepare_to_sleep(), which might be used at commit.
> > > >    Which has yet to be an actual wait that Dept considers.
> > > > 2. If the condition for sleep is true, the wait will be committed at
> > > >    __schedule(). The wait becomes an actual one that Dept considers.
> > > > 3. If the condition is false and the task gets back to TASK_RUNNING,
> > > >    clean(=reset) the staged wait.
> > > > 
> > > > That way, Dept only works with what actually hits to __schedule() for
> > > > the waits through sleep.
> > > > 
> > > > > For this all to make sense, it would need to not have false positives
> > > > > (or at least a very small number of them together with a way to sanely
> > > > 
> > > > Yes. I agree with you. I got rid of them that way I described above.
> > > >
> > > 
> > > IMHO DEPT should not report what lockdep allows (Not talking about
> > 
> > No.
> > 
> > > wait events). I mean lockdep allows some kind of nested locks but
> > > DEPT reports them.
> > 
> > You have already asked exactly same question in another thread of
> > LKML. That time I answered to it but let me explain it again.
> > 
> > ---
> > 
> > CASE 1.
> > 
> >    lock L with depth n
> >    lock_nested L' with depth n + 1
> >    ...
> >    unlock L'
> >    unlock L
> > 
> > This case is allowed by Lockdep.
> > This case is allowed by DEPT cuz it's not a deadlock.
> > 
> > CASE 2.
> > 
> >    lock L with depth n
> >    lock A
> >    lock_nested L' with depth n + 1
> >    ...
> >    unlock L'
> >    unlock A
> >    unlock L
> > 
> > This case is allowed by Lockdep.
> > This case is *NOT* allowed by DEPT cuz it's a *DEADLOCK*.
> >
> 
> Yeah, in previous threads we discussed this [1]
> 
> And the case was:
> 	scan_mutex -> object_lock -> kmemleak_lock -> object_lock
> And dept reported:
> 	object_lock -> kmemleak_lock, kmemleak_lock -> object_lock as
> 	deadlock.
> 
> But IIUC - What DEPT reported happens only under scan_mutex and
> It is not simple just not to take them because the object can be removed from the
> list and freed while scanning via kmemleak_free() without kmemleak_lock and object_lock.

That should be one of the following order:

1. kmemleak_lock -> object_lock -> object_lock(nested)
2. object_lock -> object_lock(nested) -> kmemleak_lock

> Just I'm still not sure that someone will fix the warning in the future - even if the
> locking rule is not good - if it will not cause a real deadlock.

There's more important thing than making code just work for now. For
example, maintainance, communcation via code between current developers
and potential new commers in the future and so on.

At least, a comment describing why the wrong order in the code is safe
should be added. I wouldn't allow the current order in the code if I
were the maintainer.

	Byungchul

> > ---
> > 
> > The following scenario would explain why CASE 2 is problematic.
> > 
> >    THREAD X			THREAD Y
> > 
> >    lock L with depth n
> > 				lock L' with depth n
> >    lock A
> > 				lock A
> >    lock_nested L' with depth n + 1
> > 				lock_nested L'' with depth n + 1
> >    ...				...
> >    unlock L'			unlock L''
> >    unlock A			unlock A
> >    unlock L			unlock L'
> > 
> > Yes. I need to check if the report you shared with me is a true one, but
> > it's not because DEPT doesn't work with *_nested() APIs.
> >
> 
> Sorry, It was not right just to say DEPT doesn't work with _nested() APIs.
> 
> > 	Byungchul
> 
> [1] https://lore.kernel.org/lkml/20220304002809.GA6112@X58A-UD3R/
> 
> -- 
> Thanks,
> Hyeonggon
Byungchul Park May 11, 2022, 1:16 a.m. UTC | #17
On Tue, May 10, 2022 at 02:37:40PM +0900, Byungchul Park wrote:
> Ted wrote:
> > On Tue, May 10, 2022 at 09:32:13AM +0900, Byungchul Park wrote:
> > > DEPT is tracking way more objects than Lockdep so it's inevitable to be
> > > slower, but let me try to make it have the similar performance to
> > > Lockdep.
> > 
> > In order to eliminate some of these false positives, I suspect it's
> > going to increase the number of object classes that DEPT will need to
> > track even *more*.  At which point, the cost/benefit of DEPT may get
> > called into question, especially if all of the false positives can't
> > be suppressed.
> 
> Look. Let's talk in general terms. There's no way to get rid of the
> false positives all the way. It's a decision issue for *balancing*
> between considering potential cases and only real ones. Definitely,
> potential is not real. The more potential things we consider, the higher
> the chances are, that false positives appear.
> 
> But yes. The advantage we'd take by detecting potential ones should be
> higher than the risk of being bothered by false ones. Do you think a
> tool is useless if it produces a few false positives? Of course, it'd
> be a problem if it's too many, but otherwise, I think it'd be a great
> tool if the advantage > the risk.
> 
> Don't get me wrong here. It doesn't mean DEPT is perfect for now. The
> performance should be improved and false alarms that appear should be
> removed, of course. I'm talking about the direction.
> 
> For now, there's no tool to track wait/event itself in Linux kernel -
> a subset of the functionality exists tho. DEPT is the 1st try for that
> purpose and can be a useful tool by the right direction.
> 
> I know what you are concerning about. I bet it's false positives that
> are going to bother you once merged. I'll insist that DEPT shouldn't be
> used as a mandatory testing tool until considered stable enough. But
> what about ones who would take the advantage use DEPT. Why don't you
> think of folks who will take the advantage from the hints about
> dependency of synchronization esp. when their subsystem requires very
> complicated synchronization? Should a tool be useful only in a final
> testing stage? What about the usefulness during development stage?
> 
> It's worth noting DEPT works with any wait/event so any lockups e.g.
> even by HW-SW interface, retry logic or the like can be detected by DEPT
> once all waits and events are tagged properly. I believe the advantage
> by that is much higher than the bad side facing false alarms. It's just
> my opinion. I'm goning to respect the majority opinion.

s/take advantage/have the benefit/g

	Byungchul
Hyeonggon Yoo May 11, 2022, 10:04 a.m. UTC | #18
On Wed, May 11, 2022 at 08:39:29AM +0900, Byungchul Park wrote:
> On Tue, May 10, 2022 at 08:18:12PM +0900, Hyeonggon Yoo wrote:
> > On Mon, May 09, 2022 at 09:16:37AM +0900, Byungchul Park wrote:
> > > On Sat, May 07, 2022 at 04:20:50PM +0900, Hyeonggon Yoo wrote:
> > > > On Fri, May 06, 2022 at 09:11:35AM +0900, Byungchul Park wrote:
> > > > > Linus wrote:
> > > > > >
> > > > > > On Wed, May 4, 2022 at 1:19 AM Byungchul Park <byungchul.park@lge.com> wrote:
> > > > > > >
> > > > > > > Hi Linus and folks,
> > > > > > >
> > > > > > > I've been developing a tool for detecting deadlock possibilities by
> > > > > > > tracking wait/event rather than lock(?) acquisition order to try to
> > > > > > > cover all synchonization machanisms.
> > > > > > 
> > > > > > So what is the actual status of reports these days?
> > > > > > 
> > > > > > Last time I looked at some reports, it gave a lot of false positives
> > > > > > due to mis-understanding prepare_to_sleep().
> > > > > 
> > > > > Yes, it was. I handled the case in the following way:
> > > > > 
> > > > > 1. Stage the wait at prepare_to_sleep(), which might be used at commit.
> > > > >    Which has yet to be an actual wait that Dept considers.
> > > > > 2. If the condition for sleep is true, the wait will be committed at
> > > > >    __schedule(). The wait becomes an actual one that Dept considers.
> > > > > 3. If the condition is false and the task gets back to TASK_RUNNING,
> > > > >    clean(=reset) the staged wait.
> > > > > 
> > > > > That way, Dept only works with what actually hits to __schedule() for
> > > > > the waits through sleep.
> > > > > 
> > > > > > For this all to make sense, it would need to not have false positives
> > > > > > (or at least a very small number of them together with a way to sanely
> > > > > 
> > > > > Yes. I agree with you. I got rid of them that way I described above.
> > > > >
> > > > 
> > > > IMHO DEPT should not report what lockdep allows (Not talking about
> > > 
> > > No.
> > > 
> > > > wait events). I mean lockdep allows some kind of nested locks but
> > > > DEPT reports them.
> > > 
> > > You have already asked exactly same question in another thread of
> > > LKML. That time I answered to it but let me explain it again.
> > > 
> > > ---
> > > 
> > > CASE 1.
> > > 
> > >    lock L with depth n
> > >    lock_nested L' with depth n + 1
> > >    ...
> > >    unlock L'
> > >    unlock L
> > > 
> > > This case is allowed by Lockdep.
> > > This case is allowed by DEPT cuz it's not a deadlock.
> > > 
> > > CASE 2.
> > > 
> > >    lock L with depth n
> > >    lock A
> > >    lock_nested L' with depth n + 1
> > >    ...
> > >    unlock L'
> > >    unlock A
> > >    unlock L
> > > 
> > > This case is allowed by Lockdep.
> > > This case is *NOT* allowed by DEPT cuz it's a *DEADLOCK*.
> > >
> > 
> > Yeah, in previous threads we discussed this [1]
> > 
> > And the case was:
> > 	scan_mutex -> object_lock -> kmemleak_lock -> object_lock
> > And dept reported:
> > 	object_lock -> kmemleak_lock, kmemleak_lock -> object_lock as
> > 	deadlock.
> > 
> > But IIUC - What DEPT reported happens only under scan_mutex and
> > It is not simple just not to take them because the object can be removed from the
> > list and freed while scanning via kmemleak_free() without kmemleak_lock and object_lock.
>
>
> That should be one of the following order:
> 
> 1. kmemleak_lock -> object_lock -> object_lock(nested)
> 2. object_lock -> object_lock(nested) -> kmemleak_lock
> 
> > Just I'm still not sure that someone will fix the warning in the future - even if the
> > locking rule is not good - if it will not cause a real deadlock.
> 
> There's more important thing than making code just work for now. For
> example, maintainance, communcation via code between current developers
> and potential new commers in the future and so on.

Then we will get same reports from DEPT until already existing bad code (even if it does not
cause deadlock) is reworked. If you think that is right thing to do, okay.

> At least, a comment describing why the wrong order in the code is safe
> should be added.

AFAIK The comment is already there in mm/kmemleak.c.

> I wouldn't allow the current order in the code if I
> were the maintainer.

[+Cc Catalin]
He may have opinion.

Thanks,
Hyeonggon

> 	Byungchul
> 
> > > ---
> > > 
> > > The following scenario would explain why CASE 2 is problematic.
> > > 
> > >    THREAD X			THREAD Y
> > > 
> > >    lock L with depth n
> > > 				lock L' with depth n
> > >    lock A
> > > 				lock A
> > >    lock_nested L' with depth n + 1
> > > 				lock_nested L'' with depth n + 1
> > >    ...				...
> > >    unlock L'			unlock L''
> > >    unlock A			unlock A
> > >    unlock L			unlock L'
> > > 
> > > Yes. I need to check if the report you shared with me is a true one, but
> > > it's not because DEPT doesn't work with *_nested() APIs.
> > >
> > 
> > Sorry, It was not right just to say DEPT doesn't work with _nested() APIs.
> > 
> > > 	Byungchul
> > 
> > [1] https://lore.kernel.org/lkml/20220304002809.GA6112@X58A-UD3R/
> > 
> > -- 
> > Thanks,
> > Hyeonggon
Byungchul Park May 12, 2022, 5:25 a.m. UTC | #19
+cc mcgrof@kernel.org (firmware)
+cc holt@sgi.com (syscall reboot)

Hi Luis, Robin and folks,

I'm developing a tool for lockup detection, DETP(Dependency Tracker).
I got a DEPT report from Hyeonggon - Thanks, Hyeonggon!

It doesn't mean the code *definitely* has a deadlock. However, it looks
problematic to me. So I'd like to ask things to see if it is actually.

Because Hyeonggon didn't run decode_stacktrace.sh before sending it to
me, I don't have a report having a better debugging information. But I
can explain it in this mail. The problematic scenario looks like:


PROCESS A	PROCESS B	WORKER C

__do_sys_reboot()
		__do_sys_reboot()
 mutex_lock(&system_transition_mutex)
 ...		 mutex_lock(&system_transition_mutex) <- stuck
		 ...
				request_firmware_work_func()
				 _request_firmware()
				  firmware_fallback_sysfs()
				   usermodehelper_read_lock_wait()
				    down_read(&umhelper_sem)
				   ...
				   fw_load_sysfs_fallback()
				    fw_sysfs_wait_timeout()
				     wait_for_completion_killable_timeout(&fw_st->completion) <- stuck
 kernel_halt()
  __usermodehelper_disable()
   down_write(&umhelper_sem) <- stuck

--------------------------------------------------------
All the 3 contexts are stuck at this point.
--------------------------------------------------------

PROCESS A	PROCESS B	WORKER C

   ...
   up_write(&umhelper_sem)
 ...
 mutex_unlock(&system_transition_mutex) <- cannot wake up B

		 ...
		 kernel_halt()
		  notifier_call_chain()
		   hw_shutdown_notify()
		    kill_pending_fw_fallback_reqs()
		     __fw_load_abort()
		      complete_all(&fw_st->completion) <- cannot wake up C

				   ...
				   usermodeheler_read_unlock()
				    up_read(&umhelper_sem) <- cannot wake up A


So I think this scenario is problematic. Or am I missing something here?
Or do you think it's okay because the wait_for_completion_*() has a
timeout? AFAIK, timeouts are not supposed to fire in normal cases.

It'd be appreciated if you share your opinion on the report.

	Byungchul

---

[   18.136012][    T1] ===================================================
[   18.136419][    T1] DEPT: Circular dependency has been detected.
[   18.136782][    T1] 5.18.0-rc3-57979-gc2b89afca919 #2374 Tainted: G    B            
[   18.137249][    T1] ---------------------------------------------------
[   18.137649][    T1] summary
[   18.137823][    T1] ---------------------------------------------------
[   18.138222][    T1] *** DEADLOCK ***
[   18.138222][    T1] 
[   18.138569][    T1] context A
[   18.138754][    T1]     [S] __mutex_lock_common(system_transition_mutex:0)
[   18.139170][    T1]     [W] down_write(umhelper_sem:0)
[   18.139482][    T1]     [E] mutex_unlock(system_transition_mutex:0)
[   18.139865][    T1] 
[   18.140004][    T1] context B
[   18.140189][    T1]     [S] (unknown)(&fw_st->completion:0)
[   18.140527][    T1]     [W] __mutex_lock_common(system_transition_mutex:0)
[   18.140942][    T1]     [E] complete_all(&fw_st->completion:0)
[   18.141295][    T1] 
[   18.141434][    T1] context C
[   18.141618][    T1]     [S] down_read(umhelper_sem:0)
[   18.141926][    T1]     [W] wait_for_completion_killable_timeout(&fw_st->completion:0)
[   18.142402][    T1]     [E] up_read(umhelper_sem:0)
[   18.142699][    T1] 
[   18.142837][    T1] [S]: start of the event context
[   18.143134][    T1] [W]: the wait blocked
[   18.143379][    T1] [E]: the event not reachable
[   18.143661][    T1] ---------------------------------------------------
[   18.144063][    T1] context A's detail
[   18.144293][    T1] ---------------------------------------------------
[   18.144691][    T1] context A
[   18.144875][    T1]     [S] __mutex_lock_common(system_transition_mutex:0)
[   18.145290][    T1]     [W] down_write(umhelper_sem:0)
[   18.145602][    T1]     [E] mutex_unlock(system_transition_mutex:0)
[   18.145982][    T1] 
[   18.146120][    T1] [S] __mutex_lock_common(system_transition_mutex:0):
[   18.146519][    T1] [<ffffffff810ee14c>] __do_sys_reboot+0x11f/0x24f
[   18.146907][    T1] stacktrace:
[   18.147101][    T1]       __mutex_lock+0x1f3/0x3f3
[   18.147396][    T1]       __do_sys_reboot+0x11f/0x24f
[   18.147706][    T1]       do_syscall_64+0xd4/0xfb
[   18.148001][    T1]       entry_SYSCALL_64_after_hwframe+0x44/0xae
[   18.148379][    T1] 
[   18.148517][    T1] [W] down_write(umhelper_sem:0):
[   18.148815][    T1] [<ffffffff810d9c14>] __usermodehelper_disable+0x80/0x17f
[   18.149243][    T1] stacktrace:
[   18.149438][    T1]       __dept_wait+0x115/0x15b
[   18.149726][    T1]       dept_wait+0xcd/0xf3
[   18.149993][    T1]       down_write+0x4e/0x82
[   18.150266][    T1]       __usermodehelper_disable+0x80/0x17f
[   18.150615][    T1]       kernel_halt+0x33/0x5d
[   18.150893][    T1]       __do_sys_reboot+0x197/0x24f
[   18.151201][    T1]       do_syscall_64+0xd4/0xfb
[   18.151489][    T1]       entry_SYSCALL_64_after_hwframe+0x44/0xae
[   18.151866][    T1] 
[   18.152004][    T1] [E] mutex_unlock(system_transition_mutex:0):
[   18.152368][    T1] (N/A)
[   18.152532][    T1] ---------------------------------------------------
[   18.152931][    T1] context B's detail
[   18.153161][    T1] ---------------------------------------------------
[   18.153559][    T1] context B
[   18.153743][    T1]     [S] (unknown)(&fw_st->completion:0)
[   18.154082][    T1]     [W] __mutex_lock_common(system_transition_mutex:0)
[   18.154496][    T1]     [E] complete_all(&fw_st->completion:0)
[   18.154848][    T1] 
[   18.154987][    T1] [S] (unknown)(&fw_st->completion:0):
[   18.155310][    T1] (N/A)
[   18.155473][    T1] 
[   18.155612][    T1] [W] __mutex_lock_common(system_transition_mutex:0):
[   18.156014][    T1] [<ffffffff810ee14c>] __do_sys_reboot+0x11f/0x24f
[   18.156400][    T1] stacktrace:
[   18.156594][    T1]       __mutex_lock+0x1ce/0x3f3
[   18.156887][    T1]       __do_sys_reboot+0x11f/0x24f
[   18.157196][    T1]       do_syscall_64+0xd4/0xfb
[   18.157482][    T1]       entry_SYSCALL_64_after_hwframe+0x44/0xae
[   18.157856][    T1] 
[   18.157993][    T1] [E] complete_all(&fw_st->completion:0):
[   18.158330][    T1] [<ffffffff81c04230>] kill_pending_fw_fallback_reqs+0x66/0x95
[   18.158778][    T1] stacktrace:
[   18.158973][    T1]       complete_all+0x28/0x58
[   18.159256][    T1]       kill_pending_fw_fallback_reqs+0x66/0x95
[   18.159624][    T1]       fw_shutdown_notify+0x7/0xa
[   18.159929][    T1]       notifier_call_chain+0x4f/0x81
[   18.160246][    T1]       blocking_notifier_call_chain+0x4c/0x64
[   18.160611][    T1]       kernel_halt+0x13/0x5d
[   18.160889][    T1]       __do_sys_reboot+0x197/0x24f
[   18.161197][    T1]       do_syscall_64+0xd4/0xfb
[   18.161485][    T1]       entry_SYSCALL_64_after_hwframe+0x44/0xae
[   18.161861][    T1] ---------------------------------------------------
[   18.162260][    T1] context C's detail
[   18.162490][    T1] ---------------------------------------------------
[   18.162889][    T1] context C
[   18.163073][    T1]     [S] down_read(umhelper_sem:0)
[   18.163379][    T1]     [W] wait_for_completion_killable_timeout(&fw_st->completion:0)
[   18.163855][    T1]     [E] up_read(umhelper_sem:0)
[   18.164150][    T1] 
[   18.164288][    T1] [S] down_read(umhelper_sem:0):
[   18.164578][    T1] [<ffffffff810d8f99>] usermodehelper_read_lock_wait+0xad/0x139
[   18.165027][    T1] stacktrace:
[   18.165220][    T1]       down_read+0x74/0x85
[   18.165487][    T1]       usermodehelper_read_lock_wait+0xad/0x139
[   18.165860][    T1]       firmware_fallback_sysfs+0x118/0x521
[   18.166207][    T1]       _request_firmware+0x7ef/0x91b
[   18.166525][    T1]       request_firmware_work_func+0xb1/0x13b
[   18.166884][    T1]       process_one_work+0x4c3/0x771
[   18.167199][    T1]       worker_thread+0x37f/0x49e
[   18.167497][    T1]       kthread+0x122/0x131
[   18.167768][    T1]       ret_from_fork+0x1f/0x30
[   18.168055][    T1] 
[   18.168192][    T1] [W] wait_for_completion_killable_timeout(&fw_st->completion:0):
[   18.168650][    T1] [<ffffffff81c046b7>] firmware_fallback_sysfs+0x42a/0x521
[   18.169076][    T1] stacktrace:
[   18.169270][    T1]       wait_for_completion_killable_timeout+0x3c/0x58
[   18.169676][    T1]       firmware_fallback_sysfs+0x42a/0x521
[   18.170024][    T1]       _request_firmware+0x7ef/0x91b
[   18.170341][    T1]       request_firmware_work_func+0xb1/0x13b
[   18.170699][    T1]       process_one_work+0x4c3/0x771
[   18.171012][    T1]       worker_thread+0x37f/0x49e
[   18.171309][    T1]       kthread+0x122/0x131
[   18.171575][    T1]       ret_from_fork+0x1f/0x30
[   18.171863][    T1] 
[   18.172001][    T1] [E] up_read(umhelper_sem:0):
[   18.172281][    T1] (N/A)
[   18.172445][    T1] ---------------------------------------------------
[   18.172844][    T1] information that might be helpful
[   18.173151][    T1] ---------------------------------------------------
[   18.173549][    T1] CPU: 0 PID: 1 Comm: init Tainted: G    B             5.18.0-rc3-57979-gc2b89afca919 #2374
[   18.174144][    T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
[   18.174687][    T1] Call Trace:
[   18.174882][    T1]  <TASK>
[   18.175056][    T1]  print_circle+0x6a2/0x6f9
[   18.175326][    T1]  ? extend_queue+0xf4/0xf4
[   18.175594][    T1]  ? __list_add_valid+0xce/0xf6
[   18.175886][    T1]  ? __list_add+0x45/0x4e
[   18.176144][    T1]  ? print_circle+0x6f9/0x6f9
[   18.176422][    T1]  cb_check_dl+0xc0/0xc7
[   18.176675][    T1]  bfs+0x1c8/0x27b
[   18.176897][    T1]  ? disconnect_class+0x24c/0x24c
[   18.177195][    T1]  ? __list_add+0x45/0x4e
[   18.177454][    T1]  add_dep+0x140/0x217
[   18.177697][    T1]  ? add_ecxt+0x33b/0x33b
[   18.177955][    T1]  ? llist_del_first+0xc/0x46
[   18.178234][    T1]  add_wait+0x58a/0x723
[   18.178482][    T1]  ? __thaw_task+0x3e/0x3e
[   18.178745][    T1]  ? add_dep+0x217/0x217
[   18.178998][    T1]  ? llist_add_batch+0x33/0x4c
[   18.179281][    T1]  ? check_new_class+0x139/0x30d
[   18.179574][    T1]  __dept_wait+0x115/0x15b
[   18.179837][    T1]  ? __usermodehelper_disable+0x80/0x17f
[   18.180170][    T1]  ? add_wait+0x723/0x723
[   18.180426][    T1]  ? lock_release+0x338/0x338
[   18.180704][    T1]  ? __usermodehelper_disable+0x80/0x17f
[   18.181037][    T1]  dept_wait+0xcd/0xf3
[   18.181280][    T1]  down_write+0x4e/0x82
[   18.181527][    T1]  ? __usermodehelper_disable+0x80/0x17f
[   18.181861][    T1]  __usermodehelper_disable+0x80/0x17f
[   18.182184][    T1]  ? __usermodehelper_set_disable_depth+0x3a/0x3a
[   18.182565][    T1]  ? dept_ecxt_exit+0x1c9/0x1f7
[   18.182854][    T1]  ? blocking_notifier_call_chain+0x57/0x64
[   18.183205][    T1]  kernel_halt+0x33/0x5d
[   18.183458][    T1]  __do_sys_reboot+0x197/0x24f
[   18.183742][    T1]  ? kernel_power_off+0x6b/0x6b
[   18.184033][    T1]  ? dept_on+0x27/0x33
[   18.184275][    T1]  ? dept_exit+0x38/0x42
[   18.184526][    T1]  ? dept_on+0x27/0x33
[   18.184767][    T1]  ? dept_on+0x27/0x33
[   18.185008][    T1]  ? dept_exit+0x38/0x42
[   18.185260][    T1]  ? dept_enirq_transition+0x25a/0x295
[   18.185582][    T1]  ? syscall_enter_from_user_mode+0x47/0x71
[   18.185930][    T1]  ? dept_aware_softirqs_disable+0x1e/0x1e
[   18.186274][    T1]  ? syscall_enter_from_user_mode+0x47/0x71
[   18.186622][    T1]  do_syscall_64+0xd4/0xfb
[   18.186883][    T1]  ? asm_exc_page_fault+0x1e/0x30
[   18.187180][    T1]  ? dept_aware_softirqs_disable+0x1e/0x1e
[   18.187526][    T1]  ? dept_kernel_enter+0x15/0x1e
[   18.187821][    T1]  ? do_syscall_64+0x13/0xfb
[   18.188094][    T1]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   18.188443][    T1] RIP: 0033:0x4485f7
[   18.188672][    T1] Code: 00 75 05 48 83 c4 28 c3 e8 26 17 00 00 66 0f 1f 44 00 00 f3 0f 1e fa 89 fa be 69 19 12 28 bf ad de e1 fe b8 a9 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 c7 c2 b8 ff ff ff f7 d8 64 89 02 b8
[   18.189822][    T1] RSP: 002b:00007ffc92206f28 EFLAGS: 00000202 ORIG_RAX: 00000000000000a9
[   18.190316][    T1] RAX: ffffffffffffffda RBX: 00007ffc92207118 RCX: 00000000004485f7
[   18.190784][    T1] RDX: 000000004321fedc RSI: 0000000028121969 RDI: 00000000fee1dead
[   18.191254][    T1] RBP: 00007ffc92206f30 R08: 00000000016376a0 R09: 0000000000000000
[   18.191722][    T1] R10: 00000000004c3820 R11: 0000000000000202 R12: 0000000000000001
[   18.192194][    T1] R13: 00007ffc92207108 R14: 00000000004bf8d0 R15: 0000000000000001
[   18.192667][    T1]  </TASK>
Tejun Heo May 12, 2022, 9:15 a.m. UTC | #20
Hello,

Just took a look out of curiosity.

On Thu, May 12, 2022 at 02:25:57PM +0900, Byungchul Park wrote:
> PROCESS A	PROCESS B	WORKER C
> 
> __do_sys_reboot()
> 		__do_sys_reboot()
>  mutex_lock(&system_transition_mutex)
>  ...		 mutex_lock(&system_transition_mutex) <- stuck
> 		 ...
> 				request_firmware_work_func()
> 				 _request_firmware()
> 				  firmware_fallback_sysfs()
> 				   usermodehelper_read_lock_wait()
> 				    down_read(&umhelper_sem)
> 				   ...
> 				   fw_load_sysfs_fallback()
> 				    fw_sysfs_wait_timeout()
> 				     wait_for_completion_killable_timeout(&fw_st->completion) <- stuck
>  kernel_halt()
>   __usermodehelper_disable()
>    down_write(&umhelper_sem) <- stuck
> 
> --------------------------------------------------------
> All the 3 contexts are stuck at this point.
> --------------------------------------------------------
> 
> PROCESS A	PROCESS B	WORKER C
> 
>    ...
>    up_write(&umhelper_sem)
>  ...
>  mutex_unlock(&system_transition_mutex) <- cannot wake up B
> 
> 		 ...
> 		 kernel_halt()
> 		  notifier_call_chain()
> 		   hw_shutdown_notify()
> 		    kill_pending_fw_fallback_reqs()
> 		     __fw_load_abort()
> 		      complete_all(&fw_st->completion) <- cannot wake up C
> 
> 				   ...
> 				   usermodeheler_read_unlock()
> 				    up_read(&umhelper_sem) <- cannot wake up A

I'm not sure I'm reading it correctly but it looks like "process B" column
is superflous given that it's waiting on the same lock to do the same thing
that A is already doing (besides, you can't really halt the machine twice).
What it's reporting seems to be ABBA deadlock between A waiting on
umhelper_sem and C waiting on fw_st->completion. The report seems spurious:

1. wait_for_completion_killable_timeout() doesn't need someone to wake it up
   to make forward progress because it will unstick itself after timeout
   expires.

2. complete_all() from __fw_load_abort() isn't the only source of wakeup.
   The fw loader can be, and mainly should be, woken up by firmware loading
   actually completing instead of being aborted.

I guess the reason why B shows up there is because the operation order is
such that just between A and C, the complete_all() takes place before
__usermodehlper_disable(), so the whole thing kinda doesn't make sense as
you can't block a past operation by a future one. Inserting process B
introduces the reverse ordering.

Thanks.
Byungchul Park May 12, 2022, 11:18 a.m. UTC | #21
Tejun wrote:
> Hello,

Hello,

> I'm not sure I'm reading it correctly but it looks like "process B" column

I think you're interpreting the report correctly.

> is superflous given that it's waiting on the same lock to do the same thing
> that A is already doing (besides, you can't really halt the machine twice).

Indeed! I've been in a daze. I thought kernel_halt() can be called twice
by two different purposes. Sorry for the noise.

> What it's reporting seems to be ABBA deadlock between A waiting on
> umhelper_sem and C waiting on fw_st->completion. The report seems spurious:
>
> 1. wait_for_completion_killable_timeout() doesn't need someone to wake it up
>    to make forward progress because it will unstick itself after timeout
>    expires.

I have a question about this one. Yes, it would never been stuck thanks
to timeout. However, IIUC, timeouts are not supposed to expire in normal
cases. So I thought a timeout expiration means not a normal case so need
to inform it in terms of dependency so as to prevent further expiraton.
That's why I have been trying to track even timeout'ed APIs.

Do you think DEPT shouldn't track timeout APIs? If I was wrong, I
shouldn't track the timeout APIs any more.

> 2. complete_all() from __fw_load_abort() isn't the only source of wakeup.
>    The fw loader can be, and mainly should be, woken up by firmware loading
>    actually completing instead of being aborted.

This is the point I'd like to ask. In normal cases, fw_load_done() might
happen, of course, if the loading gets completed. However, I was
wondering if the kernel ensures either fw_load_done() or fw_load_abort()
to be called by *another* context while kernel_halt().

> Thanks.

Thank you very much!

	Byungchul

> 
> -- 
> tejun
>
Theodore Ts'o May 12, 2022, 1:56 p.m. UTC | #22
On Thu, May 12, 2022 at 08:18:24PM +0900, Byungchul Park wrote:
> I have a question about this one. Yes, it would never been stuck thanks
> to timeout. However, IIUC, timeouts are not supposed to expire in normal
> cases. So I thought a timeout expiration means not a normal case so need
> to inform it in terms of dependency so as to prevent further expiraton.
> That's why I have been trying to track even timeout'ed APIs.

As I beleive I've already pointed out to you previously in ext4 and
ocfs2, the jbd2 timeout every five seconds happens **all** the time
while the file system is mounted.  Commits more frequently than five
seconds is the exception case, at least for desktops/laptop workloads.

We *don't* get to the timeout only when a userspace process calls
fsync(2), or if the journal was incorrectly sized by the system
administrator so that it's too small, and the workload has so many
file system mutations that we have to prematurely close the
transaction ahead of the 5 second timeout.

> Do you think DEPT shouldn't track timeout APIs? If I was wrong, I
> shouldn't track the timeout APIs any more.

DEPT tracking timeouts will cause false positives in at least some
cases.  At the very least, there needs to be an easy way to suppress
these false positives on a per wait/mutex/spinlock basis.

      	       	    	     	      	   	 - Ted
Tejun Heo May 12, 2022, 4:41 p.m. UTC | #23
Hello,

On Thu, May 12, 2022 at 08:18:24PM +0900, Byungchul Park wrote:
> > 1. wait_for_completion_killable_timeout() doesn't need someone to wake it up
> >    to make forward progress because it will unstick itself after timeout
> >    expires.
> 
> I have a question about this one. Yes, it would never been stuck thanks
> to timeout. However, IIUC, timeouts are not supposed to expire in normal
> cases. So I thought a timeout expiration means not a normal case so need
> to inform it in terms of dependency so as to prevent further expiraton.
> That's why I have been trying to track even timeout'ed APIs.
> 
> Do you think DEPT shouldn't track timeout APIs? If I was wrong, I
> shouldn't track the timeout APIs any more.

Without actually surveying the use cases, I can't say for sure but my
experience has been that we often get pretty creative with timeouts and it's
something people actively think about and monitor (and it's usually not
subtle). Given that, I'm skeptical about how much value it'd add for a
dependency checker to warn about timeouts. It might be net negative than the
other way around.

> > 2. complete_all() from __fw_load_abort() isn't the only source of wakeup.
> >    The fw loader can be, and mainly should be, woken up by firmware loading
> >    actually completing instead of being aborted.
> 
> This is the point I'd like to ask. In normal cases, fw_load_done() might
> happen, of course, if the loading gets completed. However, I was
> wondering if the kernel ensures either fw_load_done() or fw_load_abort()
> to be called by *another* context while kernel_halt().

We'll have to walk through the code to tell that. On a cursory look tho, up
until that point (just before shutting down usermode helper), I don't see
anything which would actively block firmware loading.

Thanks.
Catalin Marinas May 19, 2022, 10:11 a.m. UTC | #24
On Wed, May 11, 2022 at 07:04:51PM +0900, Hyeonggon Yoo wrote:
> On Wed, May 11, 2022 at 08:39:29AM +0900, Byungchul Park wrote:
> > On Tue, May 10, 2022 at 08:18:12PM +0900, Hyeonggon Yoo wrote:
> > > On Mon, May 09, 2022 at 09:16:37AM +0900, Byungchul Park wrote:
> > > > CASE 1.
> > > > 
> > > >    lock L with depth n
> > > >    lock_nested L' with depth n + 1
> > > >    ...
> > > >    unlock L'
> > > >    unlock L
> > > > 
> > > > This case is allowed by Lockdep.
> > > > This case is allowed by DEPT cuz it's not a deadlock.
> > > > 
> > > > CASE 2.
> > > > 
> > > >    lock L with depth n
> > > >    lock A
> > > >    lock_nested L' with depth n + 1
> > > >    ...
> > > >    unlock L'
> > > >    unlock A
> > > >    unlock L
> > > > 
> > > > This case is allowed by Lockdep.
> > > > This case is *NOT* allowed by DEPT cuz it's a *DEADLOCK*.
> > > 
> > > Yeah, in previous threads we discussed this [1]
> > > 
> > > And the case was:
> > > 	scan_mutex -> object_lock -> kmemleak_lock -> object_lock
> > > And dept reported:
> > > 	object_lock -> kmemleak_lock, kmemleak_lock -> object_lock as
> > > 	deadlock.
> > > 
> > > But IIUC - What DEPT reported happens only under scan_mutex and it
> > > is not simple just not to take them because the object can be
> > > removed from the list and freed while scanning via kmemleak_free()
> > > without kmemleak_lock and object_lock.

The above kmemleak sequence shouldn't deadlock since those locks, even
if taken in a different order, are serialised by scan_mutex. For various
reasons, trying to reduce the latency, I ended up with some
fine-grained, per-object locking.

For object allocation (rbtree modification) and tree search, we use
kmemleak_lock. During scanning (which can take minutes under
scan_mutex), we want to prevent (a) long latencies and (b) freeing the
object being scanned. We release the locks regularly for (a) and hold
the object->lock for (b).

In another thread Byungchul mentioned:

|    context X			context Y
| 
|    lock mutex A		lock mutex A
|    lock B			lock C
|    lock C			lock B
|    unlock C			unlock B
|    unlock B			unlock C
|    unlock mutex A		unlock mutex A
| 
| In my opinion, lock B and lock C are unnecessary if they are always
| along with lock mutex A. Or we should keep correct lock order across all
| the code.

If these are the only two places, yes, locks B and C would be
unnecessary. But we have those locks acquired (not nested) on the
allocation path (kmemleak_lock) and freeing path (object->lock). We
don't want to block those paths while scan_mutex is held.

That said, we may be able to use a single kmemleak_lock for everything.
The object freeing path may be affected slightly during scanning but the
code does release it every MAX_SCAN_SIZE bytes. It may even get slightly
faster as we'd hammer a single lock (I'll do some benchmarks).

But from a correctness perspective, I think the DEPT tool should be
improved a bit to detect when such out of order locking is serialised by
an enclosing lock/mutex.
Byungchul Park May 23, 2022, 1:10 a.m. UTC | #25
On Thu, May 12, 2022 at 09:56:46AM -0400, Theodore Ts'o wrote:
> On Thu, May 12, 2022 at 08:18:24PM +0900, Byungchul Park wrote:
> > I have a question about this one. Yes, it would never been stuck thanks
> > to timeout. However, IIUC, timeouts are not supposed to expire in normal
> > cases. So I thought a timeout expiration means not a normal case so need
> > to inform it in terms of dependency so as to prevent further expiraton.
> > That's why I have been trying to track even timeout'ed APIs.
> 
> As I beleive I've already pointed out to you previously in ext4 and
> ocfs2, the jbd2 timeout every five seconds happens **all** the time
> while the file system is mounted.  Commits more frequently than five
> seconds is the exception case, at least for desktops/laptop workloads.

Thanks, Ted. It's easy to stop tracking APIs with timeout. I've been
just afraid that the cases that we want to suppress anyway will be
skipped.

However, I should stop it if it produces too many false alarms.

> We *don't* get to the timeout only when a userspace process calls
> fsync(2), or if the journal was incorrectly sized by the system
> administrator so that it's too small, and the workload has so many
> file system mutations that we have to prematurely close the
> transaction ahead of the 5 second timeout.

Yeah... It's how journaling works. Thanks.

> > Do you think DEPT shouldn't track timeout APIs? If I was wrong, I
> > shouldn't track the timeout APIs any more.
> 
> DEPT tracking timeouts will cause false positives in at least some
> cases.  At the very least, there needs to be an easy way to suppress
> these false positives on a per wait/mutex/spinlock basis.

The easy way is to stop tracking those that are along with timeout until
DEPT starts to consider waits/events by timeout functionality itself.

Thanks.

	Byungchul
> 
>       	       	    	     	      	   	 - Ted
Byungchul Park May 23, 2022, 2:43 a.m. UTC | #26
On Thu, May 19, 2022 at 11:11:10AM +0100, Catalin Marinas wrote:
> On Wed, May 11, 2022 at 07:04:51PM +0900, Hyeonggon Yoo wrote:
> > On Wed, May 11, 2022 at 08:39:29AM +0900, Byungchul Park wrote:
> > > On Tue, May 10, 2022 at 08:18:12PM +0900, Hyeonggon Yoo wrote:
> > > > On Mon, May 09, 2022 at 09:16:37AM +0900, Byungchul Park wrote:
> > > > > CASE 1.
> > > > > 
> > > > >    lock L with depth n
> > > > >    lock_nested L' with depth n + 1
> > > > >    ...
> > > > >    unlock L'
> > > > >    unlock L
> > > > > 
> > > > > This case is allowed by Lockdep.
> > > > > This case is allowed by DEPT cuz it's not a deadlock.
> > > > > 
> > > > > CASE 2.
> > > > > 
> > > > >    lock L with depth n
> > > > >    lock A
> > > > >    lock_nested L' with depth n + 1
> > > > >    ...
> > > > >    unlock L'
> > > > >    unlock A
> > > > >    unlock L
> > > > > 
> > > > > This case is allowed by Lockdep.
> > > > > This case is *NOT* allowed by DEPT cuz it's a *DEADLOCK*.
> > > > 
> > > > Yeah, in previous threads we discussed this [1]
> > > > 
> > > > And the case was:
> > > > 	scan_mutex -> object_lock -> kmemleak_lock -> object_lock
> > > > And dept reported:
> > > > 	object_lock -> kmemleak_lock, kmemleak_lock -> object_lock as
> > > > 	deadlock.
> > > > 
> > > > But IIUC - What DEPT reported happens only under scan_mutex and it
> > > > is not simple just not to take them because the object can be
> > > > removed from the list and freed while scanning via kmemleak_free()
> > > > without kmemleak_lock and object_lock.
> 
> The above kmemleak sequence shouldn't deadlock since those locks, even
> if taken in a different order, are serialised by scan_mutex. For various
> reasons, trying to reduce the latency, I ended up with some
> fine-grained, per-object locking.

I understand why you introduced the fine-grained lock. However, the
different order should be avoided anyway. As Steven said, Lockdep also
should've detected this case, say, this would have been detected if
Lockdep worked correctly.

It's not a technical issue to make a tool skip the reversed order when
it's already protected by another lock. Because each lock has its own
purpose as you explained, no body knows if the cases might arise that
use kmemleak_lock and object_lock only w/o holding scan_mutex someday.

I'm wondering how other folks think this case should be handled tho.

> For object allocation (rbtree modification) and tree search, we use
> kmemleak_lock. During scanning (which can take minutes under
> scan_mutex), we want to prevent (a) long latencies and (b) freeing the
> object being scanned. We release the locks regularly for (a) and hold
> the object->lock for (b).
> 
> In another thread Byungchul mentioned:
> 
> |    context X			context Y
> | 
> |    lock mutex A		lock mutex A
> |    lock B			lock C
> |    lock C			lock B
> |    unlock C			unlock B
> |    unlock B			unlock C
> |    unlock mutex A		unlock mutex A
> | 
> | In my opinion, lock B and lock C are unnecessary if they are always
> | along with lock mutex A. Or we should keep correct lock order across all
> | the code.
> 
> If these are the only two places, yes, locks B and C would be
> unnecessary. But we have those locks acquired (not nested) on the
> allocation path (kmemleak_lock) and freeing path (object->lock). We
> don't want to block those paths while scan_mutex is held.
> 
> That said, we may be able to use a single kmemleak_lock for everything.
> The object freeing path may be affected slightly during scanning but the
> code does release it every MAX_SCAN_SIZE bytes. It may even get slightly
> faster as we'd hammer a single lock (I'll do some benchmarks).
> 
> But from a correctness perspective, I think the DEPT tool should be
> improved a bit to detect when such out of order locking is serialised by
> an enclosing lock/mutex.

Again, I don't think this is a technical issue.

	Byungchul
> 
> -- 
> Catalin