[0/4] change sb_writers to use percpu_rw_semaphore
diff mbox

Message ID 20150807195552.GB28529@redhat.com
State New
Headers show

Commit Message

Oleg Nesterov Aug. 7, 2015, 7:55 p.m. UTC
Jan, Dave, please help.

I'll try to re-check/re-test, but so far I think that this and the
previous series are correct. However 3/4 from the previous series
(attached at the end just in case) uncovers (I think) some problems
in xfs locking.

What should I do now?

On 07/22, Oleg Nesterov wrote:
>
> Testing. Well, so far I only verified that ioctl(FIFREEZE) +
> ioctl(FITHAW) seems to wors "as expected" on my testing machine
> with ext3. So probably this needs more testing.

Finally I got the testing machine and ran xfstests, I did

	dd if=/dev/zero of=TEST.img bs=1MiB count=4000
	dd if=/dev/zero of=SCRATCH.img bs=1MiB count=4000

	losetup --find --show TEST.img
	losetup --find --show SCRATCH.img

	mkfs.xfs -f /dev/loop0
	mkfs.xfs -f /dev/loop1

	mkdir -p TEST SCRATCH

	mount /dev/loop0 TEST
	mount /dev/loop1 SCRATCH

	TEST_DEV=/dev/loop0 TEST_DIR=TEST SCRATCH_DEV=/dev/loop1 SCRATCH_MNT=SCRATCH \
	./check `grep -il freeze tests/*/???`

several times and every time the result looks like below:

	FSTYP         -- xfs (non-debug)
	PLATFORM      -- Linux/x86_64 intel-canoepass-10 4.1.0+
	MKFS_OPTIONS  -- -f -bsize=4096 /dev/loop1
	MOUNT_OPTIONS -- -o context=system_u:object_r:nfs_t:s0 /dev/loop1 SCRATCH

	generic/068 59s ... 61s
	generic/085 23s ... 26s
	generic/280 2s ... 3s
	generic/311 169s ... 167s
	xfs/011 21s ... 20s
	xfs/119 32s ... 21s
	xfs/297 455s ... 301s
	Ran: generic/068 generic/085 generic/280 generic/311 xfs/011 xfs/119 xfs/297
	Passed all 7 tests

But with CONFIG_LOCKDEP=y generic/068 triggers the warning:

	[   66.092831] ======================================================
	[   66.099726] [ INFO: possible circular locking dependency detected ]
	[   66.106719] 4.1.0+ #2 Not tainted
	[   66.110414] -------------------------------------------------------
	[   66.117405] fsstress/2210 is trying to acquire lock:
	[   66.122944]  (&mm->mmap_sem){++++++}, at: [<ffffffff81200562>] might_fault+0x42/0xa0
	[   66.131637]
		       but task is already holding lock:
	[   66.138146]  (&xfs_dir_ilock_class){++++..}, at: [<ffffffffa057da52>] xfs_ilock+0xc2/0x170 [xfs]
	[   66.148022]
		       which lock already depends on the new lock.

	[   66.157141]
		       the existing dependency chain (in reverse order) is:
	[   66.165490]
		       -> #4 (&xfs_dir_ilock_class){++++..}:
	[   66.170974]        [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
	[   66.177596]        [<ffffffff810f4bbc>] down_write_nested+0x3c/0x70
	[   66.184605]        [<ffffffffa057dab6>] xfs_ilock+0x126/0x170 [xfs]
	[   66.191645]        [<ffffffffa057c4da>] xfs_setattr_nonsize+0x3ba/0x5d0 [xfs]
	[   66.199638]        [<ffffffffa057cb5a>] xfs_vn_setattr+0x9a/0xd0 [xfs]
	[   66.206950]        [<ffffffff81279411>] notify_change+0x271/0x3a0
	[   66.213762]        [<ffffffff8125391b>] chown_common.isra.11+0x15b/0x200
	[   66.221251]        [<ffffffff812551a6>] SyS_lchown+0xa6/0xf0
	[   66.227576]        [<ffffffff8182662e>] system_call_fastpath+0x12/0x76
	[   66.234878]
		       -> #3 (sb_internal#2){++++++}:
	[   66.239698]        [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
	[   66.246316]        [<ffffffff81824256>] down_write+0x36/0x70
	[   66.252644]        [<ffffffff81100979>] percpu_down_write+0x39/0x110
	[   66.259760]        [<ffffffff8125901d>] freeze_super+0xbd/0x190
	[   66.266369]        [<ffffffff8126dbc8>] do_vfs_ioctl+0x3f8/0x520
	[   66.273082]        [<ffffffff8126dd71>] SyS_ioctl+0x81/0xa0
	[   66.279311]        [<ffffffff8182662e>] system_call_fastpath+0x12/0x76
	[   66.286610]
		       -> #2 (sb_pagefaults#2){++++++}:
	[   66.291623]        [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
	[   66.298237]        [<ffffffff811007c1>] percpu_down_read+0x51/0xa0
	[   66.305144]        [<ffffffff8125979b>] __sb_start_write+0xdb/0x120
	[   66.312140]        [<ffffffff81295eda>] block_page_mkwrite+0x3a/0xb0
	[   66.319245]        [<ffffffffa057046e>] xfs_filemap_page_mkwrite+0x5e/0xb0 [xfs]
	[   66.327533]        [<ffffffff812009a8>] do_page_mkwrite+0x58/0x100
	[   66.334433]        [<ffffffff81205497>] handle_mm_fault+0x537/0x17c0
	[   66.341533]        [<ffffffff8106a03c>] __do_page_fault+0x19c/0x450
	[   66.348542]        [<ffffffff8106a31f>] do_page_fault+0x2f/0x80
	[   66.355172]        [<ffffffff818286e8>] page_fault+0x28/0x30
	[   66.361493]
		       -> #1 (&(&ip->i_mmaplock)->mr_lock){++++++}:
	[   66.367659]        [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
	[   66.374275]        [<ffffffff810f4aef>] down_read_nested+0x3f/0x60
	[   66.381185]        [<ffffffffa057daf8>] xfs_ilock+0x168/0x170 [xfs]
	[   66.388212]        [<ffffffffa057050c>] xfs_filemap_fault+0x4c/0xb0 [xfs]
	[   66.395817]        [<ffffffff81200a9e>] __do_fault+0x4e/0x100
	[   66.402239]        [<ffffffff81205454>] handle_mm_fault+0x4f4/0x17c0
	[   66.409340]        [<ffffffff8106a03c>] __do_page_fault+0x19c/0x450
	[   66.416344]        [<ffffffff8106a31f>] do_page_fault+0x2f/0x80
	[   66.422959]        [<ffffffff818286e8>] page_fault+0x28/0x30
	[   66.429283]
		       -> #0 (&mm->mmap_sem){++++++}:
	[   66.434093]        [<ffffffff810fb867>] __lock_acquire+0x20d7/0x2100
	[   66.441194]        [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
	[   66.447808]        [<ffffffff8120058f>] might_fault+0x6f/0xa0
	[   66.454235]        [<ffffffff8126e05a>] filldir+0x9a/0x130
	[   66.460373]        [<ffffffffa056daed>] xfs_dir2_sf_getdents.isra.10+0x1bd/0x220 [xfs]
	[   66.469233]        [<ffffffffa056e72e>] xfs_readdir+0x17e/0x1a0 [xfs]
	[   66.476449]        [<ffffffffa057095b>] xfs_file_readdir+0x2b/0x30 [xfs]
	[   66.483954]        [<ffffffff8126de2a>] iterate_dir+0x9a/0x140
	[   66.490473]        [<ffffffff8126e361>] SyS_getdents+0x91/0x120
	[   66.497091]        [<ffffffff8182662e>] system_call_fastpath+0x12/0x76
	[   66.504381]
		       other info that might help us debug this:

	[   66.513316] Chain exists of:
			 &mm->mmap_sem --> sb_internal#2 --> &xfs_dir_ilock_class

	[   66.522619]  Possible unsafe locking scenario:

	[   66.529222]        CPU0                    CPU1
	[   66.534275]        ----                    ----
	[   66.539327]   lock(&xfs_dir_ilock_class);
	[   66.543823]                                lock(sb_internal#2);
	[   66.550465]                                lock(&xfs_dir_ilock_class);
	[   66.557767]   lock(&mm->mmap_sem);
	[   66.561580]
			*** DEADLOCK ***

	[   66.568186] 2 locks held by fsstress/2210:
	[   66.572753]  #0:  (&type->i_mutex_dir_key#5){+.+.+.}, at: [<ffffffff8126ddf1>] iterate_dir+0x61/0x140
	[   66.583103]  #1:  (&xfs_dir_ilock_class){++++..}, at: [<ffffffffa057da52>] xfs_ilock+0xc2/0x170 [xfs]
	[   66.593457]
		       stack backtrace:
	[   66.598321] CPU: 26 PID: 2210 Comm: fsstress Not tainted 4.1.0+ #2
	[   66.605215] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.R3.27.D685.1305151734 05/15/2013
	[   66.616372]  0000000000000000 0000000048bb9c77 ffff8817fa127ad8 ffffffff8181d9dd
	[   66.624663]  0000000000000000 ffffffff8288f500 ffff8817fa127b28 ffffffff810f7b26
	[   66.632955]  0000000000000002 ffff8817fa127b98 ffff8817fa127b28 ffff881803f06480
	[   66.641249] Call Trace:
	[   66.643983]  [<ffffffff8181d9dd>] dump_stack+0x45/0x57
	[   66.649713]  [<ffffffff810f7b26>] print_circular_bug+0x206/0x280
	[   66.656413]  [<ffffffff810fb867>] __lock_acquire+0x20d7/0x2100
	[   66.662919]  [<ffffffff810fa1b7>] ? __lock_acquire+0xa27/0x2100
	[   66.669523]  [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
	[   66.675543]  [<ffffffff81200562>] ? might_fault+0x42/0xa0
	[   66.681566]  [<ffffffff8120058f>] might_fault+0x6f/0xa0
	[   66.687394]  [<ffffffff81200562>] ? might_fault+0x42/0xa0
	[   66.693418]  [<ffffffff8126e05a>] filldir+0x9a/0x130
	[   66.698968]  [<ffffffffa057db34>] ? xfs_ilock_data_map_shared+0x34/0x40 [xfs]
	[   66.706941]  [<ffffffffa056daed>] xfs_dir2_sf_getdents.isra.10+0x1bd/0x220 [xfs]
	[   66.715203]  [<ffffffffa057da52>] ? xfs_ilock+0xc2/0x170 [xfs]
	[   66.721712]  [<ffffffffa056e72e>] xfs_readdir+0x17e/0x1a0 [xfs]
	[   66.728316]  [<ffffffff81822b6d>] ? mutex_lock_killable_nested+0x3ad/0x480
	[   66.735998]  [<ffffffffa057095b>] xfs_file_readdir+0x2b/0x30 [xfs]
	[   66.742894]  [<ffffffff8126de2a>] iterate_dir+0x9a/0x140
	[   66.748819]  [<ffffffff8127a1ac>] ? __fget_light+0x6c/0xa0
	[   66.754938]  [<ffffffff8126e361>] SyS_getdents+0x91/0x120
	[   66.760958]  [<ffffffff8126dfc0>] ? fillonedir+0xf0/0xf0
	[   66.766884]  [<ffffffff8182662e>] system_call_fastpath+0x12/0x76

The chain reported by lockdep is not exactly the same every time,
but similar.

Once again, I'll recheck. but the patch below still looks correct
to me, and I think that it is actually a fix.

Before this patch freeze_super() calls sync_filesystem() and
s_op->freeze_fs(sb) without ->s_writers locks (as it seen by
lockdep) and this is wrong.

After this patch lockdep knows about ->s_writers locks held and this
is what we want, but apparently this way lockdep can notice the new
dependencies.

Oleg.

------------------------------------------------------------------------------
Subject: [PATCH 3/4] move rwsem_release() from sb_wait_write() to freeze_super()

Move the "fool the lockdep" code from sb_wait_write() into the new
simple helper, sb_lockdep_release(), called once before return from
freeze_super().

This is preparation, but imo this makes sense in any case. This way
we do not hide from lockdep the "write" locks we hold when we call
s_op->freeze_fs(sb).

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/super.c |   16 ++++++++++------
 1 files changed, 10 insertions(+), 6 deletions(-)

Comments

Jan Kara Aug. 10, 2015, 2:59 p.m. UTC | #1
Hello,

On Fri 07-08-15 21:55:52, Oleg Nesterov wrote:
> I'll try to re-check/re-test, but so far I think that this and the
> previous series are correct. However 3/4 from the previous series
> (attached at the end just in case) uncovers (I think) some problems
> in xfs locking.
> 
> What should I do now?
> 
> On 07/22, Oleg Nesterov wrote:
> >
> > Testing. Well, so far I only verified that ioctl(FIFREEZE) +
> > ioctl(FITHAW) seems to wors "as expected" on my testing machine
> > with ext3. So probably this needs more testing.
> 
> Finally I got the testing machine and ran xfstests, I did
> 
> 	dd if=/dev/zero of=TEST.img bs=1MiB count=4000
> 	dd if=/dev/zero of=SCRATCH.img bs=1MiB count=4000
> 
> 	losetup --find --show TEST.img
> 	losetup --find --show SCRATCH.img
> 
> 	mkfs.xfs -f /dev/loop0
> 	mkfs.xfs -f /dev/loop1
> 
> 	mkdir -p TEST SCRATCH
> 
> 	mount /dev/loop0 TEST
> 	mount /dev/loop1 SCRATCH
> 
> 	TEST_DEV=/dev/loop0 TEST_DIR=TEST SCRATCH_DEV=/dev/loop1 SCRATCH_MNT=SCRATCH \
> 	./check `grep -il freeze tests/*/???`
> 
> several times and every time the result looks like below:
> 
> 	FSTYP         -- xfs (non-debug)
> 	PLATFORM      -- Linux/x86_64 intel-canoepass-10 4.1.0+
> 	MKFS_OPTIONS  -- -f -bsize=4096 /dev/loop1
> 	MOUNT_OPTIONS -- -o context=system_u:object_r:nfs_t:s0 /dev/loop1 SCRATCH
> 
> 	generic/068 59s ... 61s
> 	generic/085 23s ... 26s
> 	generic/280 2s ... 3s
> 	generic/311 169s ... 167s
> 	xfs/011 21s ... 20s
> 	xfs/119 32s ... 21s
> 	xfs/297 455s ... 301s
> 	Ran: generic/068 generic/085 generic/280 generic/311 xfs/011 xfs/119 xfs/297
> 	Passed all 7 tests

Hum, I had a look at the lockdep report below and it's one of the
peculiarities of the freeze protection. For record let me repeat the full
argument for why I don't think there's a possibility for a real deadlock.
Feel free to skip to the end if you get bored.  

One would like to construct the lock chain as:

CPU0 (chown foo dir)	CPU1 (readdir dir)	CPU2 (page fault)
process Y		process X, thread 0	process X, thread 1

			get ILOCK for dir
gets freeze protection
starts transaction in xfs_setattr_nonsize
waits to get ILOCK on 'dir'
						get mmap_sem for X
			wait for mmap_sem for process X
			  in filldir()
						wait for freeze protection in
						  xfs_page_mkwrite

and CPU3 then being in freeze_super() blocking CPU2 and waiting for CPU0 to
finish it's freeze-protected section. But this cannot happen. The reason is
that we block writers level-by-level and thus while there are writers at
level X, we do not block writers at level X+1. So in this particular case
freeze_super() will block waiting for CPU0 to finish its freeze protected
section while CPU2 is free to continue.

In general we have a chain like

freeze L0 -> freeze L1 -> freeze L2 -> ILOCK -> mmap_sem --\
                A                                          |
                \------------------------------------------/

But since ILOCK is always acquired with freeze protection at L0 and we can
block at L1 only after there are no writers at L0, this loop can never
happen.

Note that if we use the property of freezing that lock at level X+1 cannot
block when we hold lock at level X, we can as well simplify the dependency
graph and track in it only the lowest level of freeze lock that is
currently acquired (since the levels above it cannot block and do not in
any way influence blocking of other processes either and thus are
irrelevant for the purpose of deadlock detection). Then the dependency
graph we'd get would be:

freeze L0 -> ILOCK -> mmap_sem -> freeze L1

and we have a nice acyclic graph we like to see... So probably we have to
hack the lockdep instrumentation some more and just don't tell lockdep
about freeze locks at higher levels if we already hold a lock at lower
level. Thoughts?

								Honza

> But with CONFIG_LOCKDEP=y generic/068 triggers the warning:
> 
> 	[   66.092831] ======================================================
> 	[   66.099726] [ INFO: possible circular locking dependency detected ]
> 	[   66.106719] 4.1.0+ #2 Not tainted
> 	[   66.110414] -------------------------------------------------------
> 	[   66.117405] fsstress/2210 is trying to acquire lock:
> 	[   66.122944]  (&mm->mmap_sem){++++++}, at: [<ffffffff81200562>] might_fault+0x42/0xa0
> 	[   66.131637]
> 		       but task is already holding lock:
> 	[   66.138146]  (&xfs_dir_ilock_class){++++..}, at: [<ffffffffa057da52>] xfs_ilock+0xc2/0x170 [xfs]
> 	[   66.148022]
> 		       which lock already depends on the new lock.
> 
> 	[   66.157141]
> 		       the existing dependency chain (in reverse order) is:
> 	[   66.165490]
> 		       -> #4 (&xfs_dir_ilock_class){++++..}:
> 	[   66.170974]        [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
> 	[   66.177596]        [<ffffffff810f4bbc>] down_write_nested+0x3c/0x70
> 	[   66.184605]        [<ffffffffa057dab6>] xfs_ilock+0x126/0x170 [xfs]
> 	[   66.191645]        [<ffffffffa057c4da>] xfs_setattr_nonsize+0x3ba/0x5d0 [xfs]
> 	[   66.199638]        [<ffffffffa057cb5a>] xfs_vn_setattr+0x9a/0xd0 [xfs]
> 	[   66.206950]        [<ffffffff81279411>] notify_change+0x271/0x3a0
> 	[   66.213762]        [<ffffffff8125391b>] chown_common.isra.11+0x15b/0x200
> 	[   66.221251]        [<ffffffff812551a6>] SyS_lchown+0xa6/0xf0
> 	[   66.227576]        [<ffffffff8182662e>] system_call_fastpath+0x12/0x76
> 	[   66.234878]
> 		       -> #3 (sb_internal#2){++++++}:
> 	[   66.239698]        [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
> 	[   66.246316]        [<ffffffff81824256>] down_write+0x36/0x70
> 	[   66.252644]        [<ffffffff81100979>] percpu_down_write+0x39/0x110
> 	[   66.259760]        [<ffffffff8125901d>] freeze_super+0xbd/0x190
> 	[   66.266369]        [<ffffffff8126dbc8>] do_vfs_ioctl+0x3f8/0x520
> 	[   66.273082]        [<ffffffff8126dd71>] SyS_ioctl+0x81/0xa0
> 	[   66.279311]        [<ffffffff8182662e>] system_call_fastpath+0x12/0x76
> 	[   66.286610]
> 		       -> #2 (sb_pagefaults#2){++++++}:
> 	[   66.291623]        [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
> 	[   66.298237]        [<ffffffff811007c1>] percpu_down_read+0x51/0xa0
> 	[   66.305144]        [<ffffffff8125979b>] __sb_start_write+0xdb/0x120
> 	[   66.312140]        [<ffffffff81295eda>] block_page_mkwrite+0x3a/0xb0
> 	[   66.319245]        [<ffffffffa057046e>] xfs_filemap_page_mkwrite+0x5e/0xb0 [xfs]
> 	[   66.327533]        [<ffffffff812009a8>] do_page_mkwrite+0x58/0x100
> 	[   66.334433]        [<ffffffff81205497>] handle_mm_fault+0x537/0x17c0
> 	[   66.341533]        [<ffffffff8106a03c>] __do_page_fault+0x19c/0x450
> 	[   66.348542]        [<ffffffff8106a31f>] do_page_fault+0x2f/0x80
> 	[   66.355172]        [<ffffffff818286e8>] page_fault+0x28/0x30
> 	[   66.361493]
> 		       -> #1 (&(&ip->i_mmaplock)->mr_lock){++++++}:
> 	[   66.367659]        [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
> 	[   66.374275]        [<ffffffff810f4aef>] down_read_nested+0x3f/0x60
> 	[   66.381185]        [<ffffffffa057daf8>] xfs_ilock+0x168/0x170 [xfs]
> 	[   66.388212]        [<ffffffffa057050c>] xfs_filemap_fault+0x4c/0xb0 [xfs]
> 	[   66.395817]        [<ffffffff81200a9e>] __do_fault+0x4e/0x100
> 	[   66.402239]        [<ffffffff81205454>] handle_mm_fault+0x4f4/0x17c0
> 	[   66.409340]        [<ffffffff8106a03c>] __do_page_fault+0x19c/0x450
> 	[   66.416344]        [<ffffffff8106a31f>] do_page_fault+0x2f/0x80
> 	[   66.422959]        [<ffffffff818286e8>] page_fault+0x28/0x30
> 	[   66.429283]
> 		       -> #0 (&mm->mmap_sem){++++++}:
> 	[   66.434093]        [<ffffffff810fb867>] __lock_acquire+0x20d7/0x2100
> 	[   66.441194]        [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
> 	[   66.447808]        [<ffffffff8120058f>] might_fault+0x6f/0xa0
> 	[   66.454235]        [<ffffffff8126e05a>] filldir+0x9a/0x130
> 	[   66.460373]        [<ffffffffa056daed>] xfs_dir2_sf_getdents.isra.10+0x1bd/0x220 [xfs]
> 	[   66.469233]        [<ffffffffa056e72e>] xfs_readdir+0x17e/0x1a0 [xfs]
> 	[   66.476449]        [<ffffffffa057095b>] xfs_file_readdir+0x2b/0x30 [xfs]
> 	[   66.483954]        [<ffffffff8126de2a>] iterate_dir+0x9a/0x140
> 	[   66.490473]        [<ffffffff8126e361>] SyS_getdents+0x91/0x120
> 	[   66.497091]        [<ffffffff8182662e>] system_call_fastpath+0x12/0x76
> 	[   66.504381]
> 		       other info that might help us debug this:
> 
> 	[   66.513316] Chain exists of:
> 			 &mm->mmap_sem --> sb_internal#2 --> &xfs_dir_ilock_class
> 
> 	[   66.522619]  Possible unsafe locking scenario:
> 
> 	[   66.529222]        CPU0                    CPU1
> 	[   66.534275]        ----                    ----
> 	[   66.539327]   lock(&xfs_dir_ilock_class);
> 	[   66.543823]                                lock(sb_internal#2);
> 	[   66.550465]                                lock(&xfs_dir_ilock_class);
> 	[   66.557767]   lock(&mm->mmap_sem);
> 	[   66.561580]
> 			*** DEADLOCK ***
> 
> 	[   66.568186] 2 locks held by fsstress/2210:
> 	[   66.572753]  #0:  (&type->i_mutex_dir_key#5){+.+.+.}, at: [<ffffffff8126ddf1>] iterate_dir+0x61/0x140
> 	[   66.583103]  #1:  (&xfs_dir_ilock_class){++++..}, at: [<ffffffffa057da52>] xfs_ilock+0xc2/0x170 [xfs]
> 	[   66.593457]
> 		       stack backtrace:
> 	[   66.598321] CPU: 26 PID: 2210 Comm: fsstress Not tainted 4.1.0+ #2
> 	[   66.605215] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.R3.27.D685.1305151734 05/15/2013
> 	[   66.616372]  0000000000000000 0000000048bb9c77 ffff8817fa127ad8 ffffffff8181d9dd
> 	[   66.624663]  0000000000000000 ffffffff8288f500 ffff8817fa127b28 ffffffff810f7b26
> 	[   66.632955]  0000000000000002 ffff8817fa127b98 ffff8817fa127b28 ffff881803f06480
> 	[   66.641249] Call Trace:
> 	[   66.643983]  [<ffffffff8181d9dd>] dump_stack+0x45/0x57
> 	[   66.649713]  [<ffffffff810f7b26>] print_circular_bug+0x206/0x280
> 	[   66.656413]  [<ffffffff810fb867>] __lock_acquire+0x20d7/0x2100
> 	[   66.662919]  [<ffffffff810fa1b7>] ? __lock_acquire+0xa27/0x2100
> 	[   66.669523]  [<ffffffff810fc94e>] lock_acquire+0xbe/0x150
> 	[   66.675543]  [<ffffffff81200562>] ? might_fault+0x42/0xa0
> 	[   66.681566]  [<ffffffff8120058f>] might_fault+0x6f/0xa0
> 	[   66.687394]  [<ffffffff81200562>] ? might_fault+0x42/0xa0
> 	[   66.693418]  [<ffffffff8126e05a>] filldir+0x9a/0x130
> 	[   66.698968]  [<ffffffffa057db34>] ? xfs_ilock_data_map_shared+0x34/0x40 [xfs]
> 	[   66.706941]  [<ffffffffa056daed>] xfs_dir2_sf_getdents.isra.10+0x1bd/0x220 [xfs]
> 	[   66.715203]  [<ffffffffa057da52>] ? xfs_ilock+0xc2/0x170 [xfs]
> 	[   66.721712]  [<ffffffffa056e72e>] xfs_readdir+0x17e/0x1a0 [xfs]
> 	[   66.728316]  [<ffffffff81822b6d>] ? mutex_lock_killable_nested+0x3ad/0x480
> 	[   66.735998]  [<ffffffffa057095b>] xfs_file_readdir+0x2b/0x30 [xfs]
> 	[   66.742894]  [<ffffffff8126de2a>] iterate_dir+0x9a/0x140
> 	[   66.748819]  [<ffffffff8127a1ac>] ? __fget_light+0x6c/0xa0
> 	[   66.754938]  [<ffffffff8126e361>] SyS_getdents+0x91/0x120
> 	[   66.760958]  [<ffffffff8126dfc0>] ? fillonedir+0xf0/0xf0
> 	[   66.766884]  [<ffffffff8182662e>] system_call_fastpath+0x12/0x76
> 
> The chain reported by lockdep is not exactly the same every time,
> but similar.
> 
> Once again, I'll recheck. but the patch below still looks correct
> to me, and I think that it is actually a fix.
> 
> Before this patch freeze_super() calls sync_filesystem() and
> s_op->freeze_fs(sb) without ->s_writers locks (as it seen by
> lockdep) and this is wrong.
> 
> After this patch lockdep knows about ->s_writers locks held and this
> is what we want, but apparently this way lockdep can notice the new
> dependencies.
> 
> Oleg.
> 
> ------------------------------------------------------------------------------
> Subject: [PATCH 3/4] move rwsem_release() from sb_wait_write() to freeze_super()
> 
> Move the "fool the lockdep" code from sb_wait_write() into the new
> simple helper, sb_lockdep_release(), called once before return from
> freeze_super().
> 
> This is preparation, but imo this makes sense in any case. This way
> we do not hide from lockdep the "write" locks we hold when we call
> s_op->freeze_fs(sb).
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>  fs/super.c |   16 ++++++++++------
>  1 files changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index d0fdd49..e7ea9f1 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1236,16 +1236,10 @@ static void sb_wait_write(struct super_block *sb, int level)
>  {
>  	s64 writers;
>  
> -	/*
> -	 * We just cycle-through lockdep here so that it does not complain
> -	 * about returning with lock to userspace
> -	 */
>  	rwsem_acquire(&sb->s_writers.lock_map[level-1], 0, 0, _THIS_IP_);
> -	rwsem_release(&sb->s_writers.lock_map[level-1], 1, _THIS_IP_);
>  
>  	do {
>  		DEFINE_WAIT(wait);
> -
>  		/*
>  		 * We use a barrier in prepare_to_wait() to separate setting
>  		 * of frozen and checking of the counter
> @@ -1261,6 +1255,14 @@ static void sb_wait_write(struct super_block *sb, int level)
>  	} while (writers);
>  }
>  
> +static void sb_freeze_release(struct super_block *sb)
> +{
> +	int level;
> +	/* Avoid the warning from lockdep_sys_exit() */
> +	for (level = 0; level < SB_FREEZE_LEVELS; ++level)
> +		rwsem_release(sb->s_writers.lock_map + level, 1, _THIS_IP_);
> +}
> +
>  /**
>   * freeze_super - lock the filesystem and force it into a consistent state
>   * @sb: the super to lock
> @@ -1349,6 +1351,7 @@ int freeze_super(struct super_block *sb)
>  			sb->s_writers.frozen = SB_UNFROZEN;
>  			smp_wmb();
>  			wake_up(&sb->s_writers.wait_unfrozen);
> +			sb_freeze_release(sb);
>  			deactivate_locked_super(sb);
>  			return ret;
>  		}
> @@ -1358,6 +1361,7 @@ int freeze_super(struct super_block *sb)
>  	 * sees write activity when frozen is set to SB_FREEZE_COMPLETE.
>  	 */
>  	sb->s_writers.frozen = SB_FREEZE_COMPLETE;
> +	sb_freeze_release(sb);
>  	up_write(&sb->s_umount);
>  	return 0;
>  }
> -- 
> 1.5.5.1
> 
>
Dave Chinner Aug. 10, 2015, 10:41 p.m. UTC | #2
On Mon, Aug 10, 2015 at 04:59:42PM +0200, Jan Kara wrote:
> On Fri 07-08-15 21:55:52, Oleg Nesterov wrote:
> > I'll try to re-check/re-test, but so far I think that this and the
> > previous series are correct. However 3/4 from the previous series
> > (attached at the end just in case) uncovers (I think) some problems
> > in xfs locking.
....
> 
> Hum, I had a look at the lockdep report below and it's one of the
> peculiarities of the freeze protection. For record let me repeat the full
> argument for why I don't think there's a possibility for a real deadlock.
> Feel free to skip to the end if you get bored.  
> 
> One would like to construct the lock chain as:
> 
> CPU0 (chown foo dir)	CPU1 (readdir dir)	CPU2 (page fault)
> process Y		process X, thread 0	process X, thread 1
> 
> 			get ILOCK for dir
> gets freeze protection
> starts transaction in xfs_setattr_nonsize
> waits to get ILOCK on 'dir'
> 						get mmap_sem for X
> 			wait for mmap_sem for process X
> 			  in filldir()
> 						wait for freeze protection in
> 						  xfs_page_mkwrite
> 
> and CPU3 then being in freeze_super() blocking CPU2 and waiting for CPU0 to
> finish it's freeze-protected section. But this cannot happen. The reason is
> that we block writers level-by-level and thus while there are writers at
> level X, we do not block writers at level X+1. So in this particular case
> freeze_super() will block waiting for CPU0 to finish its freeze protected
> section while CPU2 is free to continue.
> 
> In general we have a chain like
> 
> freeze L0 -> freeze L1 -> freeze L2 -> ILOCK -> mmap_sem --\
>                 A                                          |
>                 \------------------------------------------/
> 
> But since ILOCK is always acquired with freeze protection at L0 and we can
> block at L1 only after there are no writers at L0, this loop can never
> happen.
> 
> Note that if we use the property of freezing that lock at level X+1 cannot
> block when we hold lock at level X, we can as well simplify the dependency
> graph and track in it only the lowest level of freeze lock that is
> currently acquired (since the levels above it cannot block and do not in
> any way influence blocking of other processes either and thus are
> irrelevant for the purpose of deadlock detection). Then the dependency
> graph we'd get would be:
> 
> freeze L0 -> ILOCK -> mmap_sem -> freeze L1
> 
> and we have a nice acyclic graph we like to see... So probably we have to
> hack the lockdep instrumentation some more and just don't tell lockdep
> about freeze locks at higher levels if we already hold a lock at lower
> level. Thoughts?

The XFS directory ilock->filldir->might_fault locking path has been
generating false positives in quite a lot of places because of
things we do on one side of the mmap_sem in filesystem paths vs
thigs we do on the other side of the mmap_sem in the page fault
path.

I'm getting sick of these stupid lockdep false positives. I think I
need to rework the XFS readdir locking patch I wrote a while back:

http://oss.sgi.com/archives/xfs/2014-03/msg00146.html

At the time it wasn't clear what the best way forward was. Right now
I think the method I originally used (IOLOCK for directory data and
hence readdir, ILOCK for everything else) will be sufficient; if we
need to do anything smarter that can be addressed further down the
road.

Cheers,

Dave.

Patch
diff mbox

diff --git a/fs/super.c b/fs/super.c
index d0fdd49..e7ea9f1 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1236,16 +1236,10 @@  static void sb_wait_write(struct super_block *sb, int level)
 {
 	s64 writers;
 
-	/*
-	 * We just cycle-through lockdep here so that it does not complain
-	 * about returning with lock to userspace
-	 */
 	rwsem_acquire(&sb->s_writers.lock_map[level-1], 0, 0, _THIS_IP_);
-	rwsem_release(&sb->s_writers.lock_map[level-1], 1, _THIS_IP_);
 
 	do {
 		DEFINE_WAIT(wait);
-
 		/*
 		 * We use a barrier in prepare_to_wait() to separate setting
 		 * of frozen and checking of the counter
@@ -1261,6 +1255,14 @@  static void sb_wait_write(struct super_block *sb, int level)
 	} while (writers);
 }
 
+static void sb_freeze_release(struct super_block *sb)
+{
+	int level;
+	/* Avoid the warning from lockdep_sys_exit() */
+	for (level = 0; level < SB_FREEZE_LEVELS; ++level)
+		rwsem_release(sb->s_writers.lock_map + level, 1, _THIS_IP_);
+}
+
 /**
  * freeze_super - lock the filesystem and force it into a consistent state
  * @sb: the super to lock
@@ -1349,6 +1351,7 @@  int freeze_super(struct super_block *sb)
 			sb->s_writers.frozen = SB_UNFROZEN;
 			smp_wmb();
 			wake_up(&sb->s_writers.wait_unfrozen);
+			sb_freeze_release(sb);
 			deactivate_locked_super(sb);
 			return ret;
 		}
@@ -1358,6 +1361,7 @@  int freeze_super(struct super_block *sb)
 	 * sees write activity when frozen is set to SB_FREEZE_COMPLETE.
 	 */
 	sb->s_writers.frozen = SB_FREEZE_COMPLETE;
+	sb_freeze_release(sb);
 	up_write(&sb->s_umount);
 	return 0;
 }