diff mbox

[V2,2/2] fs/super.c: don't fool lockdep in freeze_super() and thaw_super() paths

Message ID 20160926165525.GA9338@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Oleg Nesterov Sept. 26, 2016, 4:55 p.m. UTC
On 09/26, Jan Kara wrote:
>
> On Mon 26-09-16 18:08:06, Oleg Nesterov wrote:
> > +/*
> > + * Tell lockdep we are holding these locks before we call ->unfreeze_fs(sb).
> > + */
> > +static void sb_freeze_acquire(struct super_block *sb)
>
> Can we call this lockdep_sb_freeze_acquire() or something like that so that
> it is clear this is only about lockdep annotations? Similarly with
> sb_freeze_unlock()...

OK, thanks, done. See V2 below.

> and I hope you really tested
> there are no more lockdep false positives ;).

Heh ;) if only I knew how to test this... I ran the following script under qemu

	mkfs.xfs -f /dev/vda
	mkfs.xfs -f /dev/vdb

	mkdir -p TEST SCRATCH

	TEST_DEV=/dev/vda TEST_DIR=TEST SCRATCH_DEV=/dev/vdb SCRATCH_MNT=SCRATCH \
	./check `grep -il freeze tests/*/???`

this time.

And yes, I'm afraid this change can uncover some false positives later. But at
the same time potentially it can find the real problems.

It would be nice to remove another hack in __sb_start_write under ifdef(CONFIG_LOCKDEP),
but iirc XFS actually takes the same rw_sem twice for reading, so we can't do this.

-------------------------------------------------------------------------------
From 70e774533ab6319f9b90a490b036150ad9604af7 Mon Sep 17 00:00:00 2001
From: Oleg Nesterov <oleg@redhat.com>
Date: Mon, 26 Sep 2016 17:23:24 +0200
Subject: [PATCH V2 2/2] fs/super.c: don't fool lockdep in freeze_super() and thaw_super() paths

sb_wait_write()->percpu_rwsem_release() fools lockdep to avoid the
false-positives. Now that xfs was fixed by Dave's commit dbad7c993053
("xfs: stop holding ILOCK over filldir callbacks") we can remove it and
change freeze_super() and thaw_super() to run with s_writers.rw_sem locks
held; we add two trivial helpers for that, lockdep_sb_freeze_release()
and lockdep_sb_freeze_acquire().

xfstests-dev/check `grep -il freeze tests/*/???` does not trigger any
warning from lockdep.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 fs/super.c | 37 +++++++++++++++++++++++++------------
 1 file changed, 25 insertions(+), 12 deletions(-)

Comments

Jan Kara Sept. 27, 2016, 6:51 a.m. UTC | #1
On Mon 26-09-16 18:55:25, Oleg Nesterov wrote:
> On 09/26, Jan Kara wrote:
> >
> > On Mon 26-09-16 18:08:06, Oleg Nesterov wrote:
> > > +/*
> > > + * Tell lockdep we are holding these locks before we call ->unfreeze_fs(sb).
> > > + */
> > > +static void sb_freeze_acquire(struct super_block *sb)
> >
> > Can we call this lockdep_sb_freeze_acquire() or something like that so that
> > it is clear this is only about lockdep annotations? Similarly with
> > sb_freeze_unlock()...
> 
> OK, thanks, done. See V2 below.
> 
> > and I hope you really tested
> > there are no more lockdep false positives ;).
> 
> Heh ;) if only I knew how to test this... I ran the following script
> under qemu
> 
> 	mkfs.xfs -f /dev/vda
> 	mkfs.xfs -f /dev/vdb
> 
> 	mkdir -p TEST SCRATCH
> 
> 	TEST_DEV=/dev/vda TEST_DIR=TEST SCRATCH_DEV=/dev/vdb SCRATCH_MNT=SCRATCH \
> 	./check `grep -il freeze tests/*/???`

You can run either:

	./check -g freeze

to check just the freezing tests or

	./check

to run all sensible tests which is what I'd do (but it will take couple of
hours to pass). If that passes, chances are good there are no easy false
positives.

> And yes, I'm afraid this change can uncover some false positives later.
> But at the same time potentially it can find the real problems.

Well, sure it's not an end of world if there is some false positive - we
can just revert the change - but lockdep false positives are always
annoying because they take time to analyze and until they are fixed, you
are unable to see other probles found by lockdep...

> It would be nice to remove another hack in __sb_start_write under
> ifdef(CONFIG_LOCKDEP), but iirc XFS actually takes the same rw_sem twice
> for reading, so we can't do this.

Yes, and I don't really consider this a hack. Filesystem freezing was never
meant to have all the properties of classical rwsems in kernel - I just
happened to notice that the semantics is close to rwsems and so decided to
model it as such in lockdep to get some lockdep verification for it. In
particular "read lock recursion" has always been OK with freeze protection
by design and I don't see strong motivation for changing that.

> -------------------------------------------------------------------------------
> From 70e774533ab6319f9b90a490b036150ad9604af7 Mon Sep 17 00:00:00 2001
> From: Oleg Nesterov <oleg@redhat.com>
> Date: Mon, 26 Sep 2016 17:23:24 +0200
> Subject: [PATCH V2 2/2] fs/super.c: don't fool lockdep in freeze_super() and thaw_super() paths
> 
> sb_wait_write()->percpu_rwsem_release() fools lockdep to avoid the
> false-positives. Now that xfs was fixed by Dave's commit dbad7c993053
> ("xfs: stop holding ILOCK over filldir callbacks") we can remove it and
> change freeze_super() and thaw_super() to run with s_writers.rw_sem locks
> held; we add two trivial helpers for that, lockdep_sb_freeze_release()
> and lockdep_sb_freeze_acquire().
> 
> xfstests-dev/check `grep -il freeze tests/*/???` does not trigger any
> warning from lockdep.
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

Thanks. The patch looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
> ---
>  fs/super.c | 37 +++++++++++++++++++++++++------------
>  1 file changed, 25 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 2549896c..0447afe 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1214,25 +1214,34 @@ EXPORT_SYMBOL(__sb_start_write);
>  static void sb_wait_write(struct super_block *sb, int level)
>  {
>  	percpu_down_write(sb->s_writers.rw_sem + level-1);
> -	/*
> -	 * We are going to return to userspace and forget about this lock, the
> -	 * ownership goes to the caller of thaw_super() which does unlock.
> -	 *
> -	 * FIXME: we should do this before return from freeze_super() after we
> -	 * called sync_filesystem(sb) and s_op->freeze_fs(sb), and thaw_super()
> -	 * should re-acquire these locks before s_op->unfreeze_fs(sb). However
> -	 * this leads to lockdep false-positives, so currently we do the early
> -	 * release right after acquire.
> -	 */
> -	percpu_rwsem_release(sb->s_writers.rw_sem + level-1, 0, _THIS_IP_);
>  }
>  
> -static void sb_freeze_unlock(struct super_block *sb)
> +/*
> + * We are going to return to userspace and forget about these locks, the
> + * ownership goes to the caller of thaw_super() which does unlock().
> + */
> +static void lockdep_sb_freeze_release(struct super_block *sb)
> +{
> +	int level;
> +
> +	for (level = SB_FREEZE_LEVELS - 1; level >= 0; level--)
> +		percpu_rwsem_release(sb->s_writers.rw_sem + level, 0, _THIS_IP_);
> +}
> +
> +/*
> + * Tell lockdep we are holding these locks before we call ->unfreeze_fs(sb).
> + */
> +static void lockdep_sb_freeze_acquire(struct super_block *sb)
>  {
>  	int level;
>  
>  	for (level = 0; level < SB_FREEZE_LEVELS; ++level)
>  		percpu_rwsem_acquire(sb->s_writers.rw_sem + level, 0, _THIS_IP_);
> +}
> +
> +static void sb_freeze_unlock(struct super_block *sb)
> +{
> +	int level;
>  
>  	for (level = SB_FREEZE_LEVELS - 1; level >= 0; level--)
>  		percpu_up_write(sb->s_writers.rw_sem + level);
> @@ -1328,6 +1337,7 @@ int freeze_super(struct super_block *sb)
>  	 * when frozen is set to SB_FREEZE_COMPLETE, and for thaw_super().
>  	 */
>  	sb->s_writers.frozen = SB_FREEZE_COMPLETE;
> +	lockdep_sb_freeze_release(sb);
>  	up_write(&sb->s_umount);
>  	return 0;
>  }
> @@ -1354,11 +1364,14 @@ int thaw_super(struct super_block *sb)
>  		goto out;
>  	}
>  
> +	lockdep_sb_freeze_acquire(sb);
> +
>  	if (sb->s_op->unfreeze_fs) {
>  		error = sb->s_op->unfreeze_fs(sb);
>  		if (error) {
>  			printk(KERN_ERR
>  				"VFS:Filesystem thaw failed\n");
> +			lockdep_sb_freeze_release(sb);
>  			up_write(&sb->s_umount);
>  			return error;
>  		}
> -- 
> 2.5.0
> 
>
Dave Chinner Sept. 27, 2016, 7:14 a.m. UTC | #2
On Tue, Sep 27, 2016 at 08:51:35AM +0200, Jan Kara wrote:
> On Mon 26-09-16 18:55:25, Oleg Nesterov wrote:
> > On 09/26, Jan Kara wrote:
> > >
> > > On Mon 26-09-16 18:08:06, Oleg Nesterov wrote:
> > > > +/*
> > > > + * Tell lockdep we are holding these locks before we call ->unfreeze_fs(sb).
> > > > + */
> > > > +static void sb_freeze_acquire(struct super_block *sb)
> > >
> > > Can we call this lockdep_sb_freeze_acquire() or something like that so that
> > > it is clear this is only about lockdep annotations? Similarly with
> > > sb_freeze_unlock()...
> > 
> > OK, thanks, done. See V2 below.
> > 
> > > and I hope you really tested
> > > there are no more lockdep false positives ;).
> > 
> > Heh ;) if only I knew how to test this... I ran the following script
> > under qemu
> > 
> > 	mkfs.xfs -f /dev/vda
> > 	mkfs.xfs -f /dev/vdb
> > 
> > 	mkdir -p TEST SCRATCH
> > 
> > 	TEST_DEV=/dev/vda TEST_DIR=TEST SCRATCH_DEV=/dev/vdb SCRATCH_MNT=SCRATCH \
> > 	./check `grep -il freeze tests/*/???`
> 
> You can run either:
> 
> 	./check -g freeze
> 
> to check just the freezing tests or
> 
> 	./check

Better for regression testing is:

check -g auto

so that is skips all the tests that are broken or likely to crash
the machine on some debug check.

Cheers,

Dave.
Oleg Nesterov Sept. 27, 2016, 5:29 p.m. UTC | #3
On 09/27, Jan Kara wrote:
>
> On Mon 26-09-16 18:55:25, Oleg Nesterov wrote:
> >
> > Heh ;) if only I knew how to test this... I ran the following script
> > under qemu
> >
> > 	mkfs.xfs -f /dev/vda
> > 	mkfs.xfs -f /dev/vdb
> >
> > 	mkdir -p TEST SCRATCH
> >
> > 	TEST_DEV=/dev/vda TEST_DIR=TEST SCRATCH_DEV=/dev/vdb SCRATCH_MNT=SCRATCH \
> > 	./check `grep -il freeze tests/*/???`
>
> You can run either:
>
> 	./check -g freeze

passed all 6 tests.

> to check just the freezing tests or
>
> 	./check
>
> to run all sensible tests which is what I'd do (but it will take couple of
> hours to pass). If that passes, chances are good there are no easy false
> positives.

It seems that generic/001 just hangs on my laptop. With or without this change.
Or perhaps I didn't wait enough... Or perhaps something is wrong with my very
limited testing environment. I'll reserve a testing machine tomorrow.

> > And yes, I'm afraid this change can uncover some false positives later.
> > But at the same time potentially it can find the real problems.
>
> Well, sure it's not an end of world if there is some false positive - we
> can just revert the change - but lockdep false positives are always
> annoying because they take time to analyze and until they are fixed, you
> are unable to see other probles found by lockdep...

Yes, yes, agreed.

> > It would be nice to remove another hack in __sb_start_write under
> > ifdef(CONFIG_LOCKDEP), but iirc XFS actually takes the same rw_sem twice
> > for reading, so we can't do this.
>
> Yes, and I don't really consider this a hack.

Ah, sorry, I didn't try to blame XFS/fs. I meant, this "force_trylock" hack
doesn't look nice. Perhaps we can use rwsem_acquire_nest() instead.

> Reviewed-by: Jan Kara <jack@suse.cz>

Thanks!

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Nesterov Sept. 30, 2016, 5:14 p.m. UTC | #4
On 09/27, Oleg Nesterov wrote:
>
> On 09/27, Jan Kara wrote:
> >
> > You can run either:
> >
> > 	./check -g freeze
>
> passed all 6 tests.
>
> > to check just the freezing tests or
> >
> > 	./check
> >
> > to run all sensible tests which is what I'd do (but it will take couple of
> > hours to pass). If that passes, chances are good there are no easy false
> > positives.
>
> It seems that generic/001 just hangs on my laptop. With or without this change.
> Or perhaps I didn't wait enough...

/usr/bin/awk spins in user-mode forever, according to strace it doesn't do
any syscalls. I didn't even try to investigate.

> Or perhaps something is wrong with my very
> limited testing environment. I'll reserve a testing machine tomorrow.

Jan, I gave up.

Whatever I did xfstests-dev/check reports a lot of failures, kernel bugs,
and finally either crashes the kernel or hangs.

With or without this change, and I didn't notice any warning from lockdep.
So I hope we can apply this patch. At least 1/2 which fixes a bug.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Oct. 2, 2016, 9:42 p.m. UTC | #5
On Fri, Sep 30, 2016 at 07:14:34PM +0200, Oleg Nesterov wrote:
> On 09/27, Oleg Nesterov wrote:
> >
> > On 09/27, Jan Kara wrote:
> > >
> > > You can run either:
> > >
> > > 	./check -g freeze
> >
> > passed all 6 tests.
> >
> > > to check just the freezing tests or
> > >
> > > 	./check
> > >
> > > to run all sensible tests which is what I'd do (but it will take couple of
> > > hours to pass). If that passes, chances are good there are no easy false
> > > positives.
> >
> > It seems that generic/001 just hangs on my laptop. With or without this change.
> > Or perhaps I didn't wait enough...
> 
> /usr/bin/awk spins in user-mode forever, according to strace it doesn't do
> any syscalls. I didn't even try to investigate.

Are you using gawk, or some other version of awk? gawk doesn't have
any problems (it is listed as a prereq package in the README file)
that anyone has reported, and there are a lot of people using it...

> > Or perhaps something is wrong with my very
> > limited testing environment. I'll reserve a testing machine tomorrow.
> 
> Jan, I gave up.
> 
> Whatever I did xfstests-dev/check reports a lot of failures, kernel bugs,
> and finally either crashes the kernel or hangs.

Did you run the auto group like I suggested? That's the set of tests
that should complete successfully with minimal failures and without
crashing the machine. If you're running this group and there's
failures, hangs and crashes all over the place, then you need to
start reporting bugs because that should not be happening on any
kernel....

Cheers,

Dave.
Oleg Nesterov Oct. 3, 2016, 4:44 p.m. UTC | #6
On 10/03, Dave Chinner wrote:
>
> On Fri, Sep 30, 2016 at 07:14:34PM +0200, Oleg Nesterov wrote:
> > On 09/27, Oleg Nesterov wrote:
> > >
> > >
> > > It seems that generic/001 just hangs on my laptop. With or without this change.
> > > Or perhaps I didn't wait enough...
> >
> > /usr/bin/awk spins in user-mode forever, according to strace it doesn't do
> > any syscalls. I didn't even try to investigate.
>
> Are you using gawk, or some other version of awk? gawk doesn't have
> any problems

gawk, awk is symlink. And somehow this depends on userspace environment,
it doesn't hang on the full-blown Fedora 23.

> > Jan, I gave up.
> >
> > Whatever I did xfstests-dev/check reports a lot of failures, kernel bugs,
> > and finally either crashes the kernel or hangs.
>
> Did you run the auto group like I suggested? That's the set of tests
> that should complete successfully with minimal failures and without
> crashing the machine.

OK, I'll reserve a testing machine again.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Nesterov Oct. 4, 2016, 4:58 p.m. UTC | #7
On 10/04, Oleg Nesterov wrote:
>
> This time it hangs after generic/274:
>
> 	--- tests/generic/274.out	2016-10-04 04:23:24.209006171 -0400
> 	+++ /root/XFS/xfstests-dev/results//generic/274.out.bad	2016-10-04 05:17:49.291742498 -0400
> 	@@ -3,3 +3,4 @@
> 	preallocation test
> 	------------------------------
> 	done
> 	+./common/rc: line 344:  4693 Segmentation fault      exit
> 	...
> 	(Run 'diff -u tests/generic/274.out /root/XFS/xfstests-dev/results//generic/274.out.bad'  to see the entire diff)
>
> because of kernel bug:
>
> 	[ 2730.242537] run fstests generic/274 at 2016-10-04 05:17:34
> 	[ 2730.738352] XFS (loop1): Mounting V5 Filesystem
> 	[ 2730.741451] XFS (loop1): Ending clean mount
> 	[ 2744.508698] ------------[ cut here ]------------
> 	[ 2744.509190] kernel BUG at ./include/linux/swap.h:276!

I removed this test and then the next run (after reboot) hangs at xfs/073 with
a lot of errors in dmesg like

	XFS (loop2): Failing async write on buffer block 0x9600790. Retrying async write.
	blk_update_request: I/O error, dev loop2, sector 8389920
	loop: Write error at byte offset 4295647232, length 4096.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Oct. 4, 2016, 7:44 p.m. UTC | #8
On Tue, Oct 04, 2016 at 01:43:43PM +0200, Oleg Nesterov wrote:
> On 10/03, Oleg Nesterov wrote:
> >
> > On 10/03, Dave Chinner wrote:
> > >
> > > On Fri, Sep 30, 2016 at 07:14:34PM +0200, Oleg Nesterov wrote:
> > > > On 09/27, Oleg Nesterov wrote:
> > > > >
> > > > Jan, I gave up.
> > > >
> > > > Whatever I did xfstests-dev/check reports a lot of failures, kernel bugs,
> > > > and finally either crashes the kernel or hangs.
> > >
> > > Did you run the auto group like I suggested? That's the set of tests
> > > that should complete successfully with minimal failures and without
> > > crashing the machine.
> >
> > OK, I'll reserve a testing machine again.
> 
> Fedora 23 running in KVM guest
> 
> kernel v4.8 clean, see .config below
> 
> xfstests from git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
> last commit 0bbd20b104765c75faaf8e28a54c32df505ce7fd ("btrfs: test
> free space tree mount options")
> 
> script:
> 
> 	dd if=/dev/zero of=TEST.img bs=1MiB count=8192
> 	dd if=/dev/zero of=SCRATCH.img bs=1MiB count=8192
> 
> 	losetup --find --show TEST.img
> 	losetup --find --show SCRATCH.img
> 
> 	mkfs.xfs -f /dev/loop0
> 	mkfs.xfs -f /dev/loop1
> 
> 	mkdir -p TEST SCRATCH
> 
> 	mount /dev/loop0 TEST
> 	mount /dev/loop1 SCRATCH
> 
> 	TEST_DEV=/dev/loop0 TEST_DIR=TEST SCRATCH_DEV=/dev/loop1 SCRATCH_MNT=SCRATCH ./check -g auto

Looks sane.

> This time it hangs after generic/274:
> 
> 	--- tests/generic/274.out	2016-10-04 04:23:24.209006171 -0400
> 	+++ /root/XFS/xfstests-dev/results//generic/274.out.bad	2016-10-04 05:17:49.291742498 -0400
> 	@@ -3,3 +3,4 @@
> 	preallocation test
> 	------------------------------
> 	done
> 	+./common/rc: line 344:  4693 Segmentation fault      exit
> 	...
> 	(Run 'diff -u tests/generic/274.out /root/XFS/xfstests-dev/results//generic/274.out.bad'  to see the entire diff)
> 
> because of kernel bug:
> 
> 	[ 2730.242537] run fstests generic/274 at 2016-10-04 05:17:34
> 	[ 2730.738352] XFS (loop1): Mounting V5 Filesystem
> 	[ 2730.741451] XFS (loop1): Ending clean mount
> 	[ 2744.508698] ------------[ cut here ]------------
> 	[ 2744.509190] kernel BUG at ./include/linux/swap.h:276!
> 
> static inline void workingset_node_shadows_dec(struct radix_tree_node *node)
> {
> 	VM_BUG_ON(!workingset_node_shadows(node));
> 	node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
> }

Linus had a massive rant about this yesterday. Regression was
introduced between v4.8-rc8 and v4.8 and it took Linus's machine
down. xfstests found a filesystem related kernel bug, as it's
supposed to.

> plus the following warnings:
> 
> 	[ 1894.500040] run fstests generic/070 at 2016-10-04 05:03:39
> 	[ 1895.076655] =================================
> 	[ 1895.077136] [ INFO: inconsistent lock state ]
> 	[ 1895.077574] 4.8.0 #1 Not tainted
> 	[ 1895.077900] ---------------------------------
> 	[ 1895.078330] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
> 	[ 1895.078993] fsstress/18239 [HC0[0]:SC0[0]:HE1:SE1] takes:
> 	[ 1895.079522]  (&xfs_nondir_ilock_class){++++?-}, at: [<ffffffffc049ad45>] xfs_ilock+0x165/0x210 [xfs]
> 	[ 1895.080529] {IN-RECLAIM_FS-W} state was registered at:

[snip kswapd doing XFS inode reclaim]

> 	[ 1895.102565] CPU: 0 PID: 18239 Comm: fsstress Not tainted 4.8.0 #1
> 	[ 1895.103160] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
> 	[ 1895.103721]  0000000000000086 000000008f20d9c8 ffff880038b73818 ffffffffb6449133
> 	[ 1895.104486]  ffff880038a5cd00 ffffffffb78bcd70 ffff880038b73868 ffffffffb6106c75
> 	[ 1895.105257]  0000000000000000 0000000000000001 ffff880000000001 0000000000000008
> 	[ 1895.106034] Call Trace:
> 	[ 1895.106281]  [<ffffffffb6449133>] dump_stack+0x85/0xc2
> 	[ 1895.106789]  [<ffffffffb6106c75>] print_usage_bug+0x215/0x240
> 	[ 1895.107352]  [<ffffffffb610722b>] mark_lock+0x58b/0x650
> 	[ 1895.107888]  [<ffffffffb61061e0>] ? check_usage_forwards+0x160/0x160
> 	[ 1895.108506]  [<ffffffffb610735f>] mark_held_locks+0x6f/0xa0
> 	[ 1895.109085]  [<ffffffffb610a5ea>] lockdep_trace_alloc+0xca/0x110
> 	[ 1895.109695]  [<ffffffffb6260b89>] kmem_cache_alloc_node_trace+0x39/0x2a0
> 	[ 1895.110414]  [<ffffffffb623e2ce>] ? vm_map_ram+0x2de/0x490
> 	[ 1895.110995]  [<ffffffffb623e2ce>] vm_map_ram+0x2de/0x490
> 	[ 1895.111536]  [<ffffffffb623e03b>] ? vm_map_ram+0x4b/0x490
> 	[ 1895.112142]  [<ffffffffc04823bc>] _xfs_buf_map_pages+0x6c/0x110 [xfs]
> 	[ 1895.112837]  [<ffffffffc04847f7>] xfs_buf_get_map+0x2c7/0x470 [xfs]
> 	[ 1895.113500]  [<ffffffffc043f5b9>] xfs_attr_rmtval_set+0x2c9/0x450 [xfs]
> 	[ 1895.114235]  [<ffffffffc0438019>] xfs_attr_node_addname+0x4c9/0x5d0 [xfs]
> 	[ 1895.114948]  [<ffffffffc043917c>] xfs_attr_set+0x3dc/0x460 [xfs]
> 	[ 1895.115593]  [<ffffffffc04a8d4f>] xfs_xattr_set+0x4f/0x90 [xfs]
> 	[ 1895.116221]  [<ffffffffb62bb2d0>] generic_setxattr+0x70/0x80
> 	[ 1895.116798]  [<ffffffffb62bba9f>] __vfs_setxattr_noperm+0xaf/0x1a0
> 	[ 1895.117445]  [<ffffffffb63bc525>] ? security_inode_setxattr+0x65/0xb0
> 	[ 1895.118121]  [<ffffffffb62bbc37>] vfs_setxattr+0xa7/0xb0
> 	[ 1895.118651]  [<ffffffffb62bbd69>] setxattr+0x129/0x190

And that is a bug in the lockdep annotations for memory allocation because it
fails to take into account the current task flags that are set via
memalloc_noio_save() to prevent vmalloc from doing GFP_KERNEL allocations. i.e.
in _xfs_buf_map_pages():

                /*
                 * vm_map_ram() will allocate auxillary structures (e.g.
                 * pagetables) with GFP_KERNEL, yet we are likely to be under
                 * GFP_NOFS context here. Hence we need to tell memory reclaim
                 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
                 * memory reclaim re-entering the filesystem here and
                 * potentially deadlocking.
                 */
                noio_flag = memalloc_noio_save();
                do {
                        bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
                                                -1, PAGE_KERNEL);
                        if (bp->b_addr)
                                break;
                        vm_unmap_aliases();
                } while (retried++ <= 1);
                memalloc_noio_restore(noio_flag);

So: lockdep annotation bug.

> 	[ 2018.537291] run fstests generic/095 at 2016-10-04 05:05:43
> 	[ 2018.733488] XFS (loop1): Unmounting Filesystem
> 	[ 2019.341189] XFS (loop1): Mounting V5 Filesystem
> 	[ 2019.344046] XFS (loop1): Ending clean mount
> 	[ 2019.796930] ------------[ cut here ]------------
> 	[ 2019.797529] WARNING: CPU: 0 PID: 10347 at fs/xfs/xfs_file.c:311 xfs_file_dio_aio_read+0x2d2/0x370 [xfs]
[snip]
> 	[ 2019.814146] WARNING: CPU: 0 PID: 10363 at fs/xfs/xfs_file.c:665 xfs_file_dio_aio_write+0x391/0x460 [xfs]
[snip]

And generic/095 is expected to generate those warnings and filters
them out in the test (the test harness checks dmesg for unexpected
failures). I explained the reason for these warnings yesterday.

In summary: xfstests found a real bug, a lockdep annotation
annoyance, and exercised a path that causes data corruption and
should trigger warnings. Looks to me like xfstests is doing exactly
what it is supposed to be doing doing....

Cheers,

Dave.
Dave Chinner Oct. 4, 2016, 8:03 p.m. UTC | #9
On Tue, Oct 04, 2016 at 06:58:27PM +0200, Oleg Nesterov wrote:
> I removed this test and then the next run (after reboot) hangs at xfs/073 with
> a lot of errors in dmesg like
> 
> 	XFS (loop2): Failing async write on buffer block 0x9600790. Retrying async write.
> 	blk_update_request: I/O error, dev loop2, sector 8389920
> 	loop: Write error at byte offset 4295647232, length 4096.

tests will dump lots of errors into dmesg. That doesn't mean there's
a problem - many tests are designed to exercise error paths. That,
however, looks like a loop device problem.

xfs/073 is using loop devices internally itself, so this ends up
with XFS on loop2 on XFS on loop1. That loop2 device is 100GB in
size, and the test copies the xfstests source tree to a loopback
filesystem image in $SCRATCH_DEV mounted on $TEST_DIR/$$.

The issue is, most likely, that your TEST_DIR and SCRATCH_MNT are
rooted in the xfstests source tree. Hence copying the xfstests
source tree will also try to copy the 8GB image files into the
filesystems that the image files contain. Which, clearly, will
eventually result in ENOSPC errors when writing to the underlying
loop device...

Put your TEST_DIR and SCRATCHMNT mount points outside the xfstests
directory, and this should go away. Most people use /mnt/test and
/mnt/scratch for these....

Cheers,

Dave.
Oleg Nesterov Oct. 5, 2016, 4:33 p.m. UTC | #10
On 10/05, Dave Chinner wrote:
>
> On Tue, Oct 04, 2016 at 06:58:27PM +0200, Oleg Nesterov wrote:
> > I removed this test and then the next run (after reboot) hangs at xfs/073 with
> > a lot of errors in dmesg like
> >
> > 	XFS (loop2): Failing async write on buffer block 0x9600790. Retrying async write.
> > 	blk_update_request: I/O error, dev loop2, sector 8389920
> > 	loop: Write error at byte offset 4295647232, length 4096.
>
> tests will dump lots of errors into dmesg. That doesn't mean there's
> a problem - many tests are designed to exercise error paths. That,
> however, looks like a loop device problem.
>
> xfs/073 is using loop devices internally itself, so this ends up
> with XFS on loop2 on XFS on loop1. That loop2 device is 100GB in
> size, and the test copies the xfstests source tree to a loopback
> filesystem image in $SCRATCH_DEV mounted on $TEST_DIR/$$.
>
> The issue is, most likely, that your TEST_DIR and SCRATCH_MNT are
> rooted in the xfstests source tree. Hence copying the xfstests
> source tree will also try to copy the 8GB image files into the
> filesystems that the image files contain. Which, clearly, will
> eventually result in ENOSPC errors when writing to the underlying
> loop device...
>
> Put your TEST_DIR and SCRATCHMNT mount points outside the xfstests
> directory, and this should go away.

Yes, thanks, xfs/073 no longer hangs.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Nesterov Oct. 5, 2016, 4:44 p.m. UTC | #11
On 10/05, Dave Chinner wrote:
>
> On Tue, Oct 04, 2016 at 01:43:43PM +0200, Oleg Nesterov wrote:
>
> > plus the following warnings:
> >
> > 	[ 1894.500040] run fstests generic/070 at 2016-10-04 05:03:39
> > 	[ 1895.076655] =================================
> > 	[ 1895.077136] [ INFO: inconsistent lock state ]
> > 	[ 1895.077574] 4.8.0 #1 Not tainted
> > 	[ 1895.077900] ---------------------------------
> > 	[ 1895.078330] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
> > 	[ 1895.078993] fsstress/18239 [HC0[0]:SC0[0]:HE1:SE1] takes:
> > 	[ 1895.079522]  (&xfs_nondir_ilock_class){++++?-}, at: [<ffffffffc049ad45>] xfs_ilock+0x165/0x210 [xfs]
> > 	[ 1895.080529] {IN-RECLAIM_FS-W} state was registered at:
>
> And that is a bug in the lockdep annotations for memory allocation because it
> fails to take into account the current task flags that are set via
> memalloc_noio_save() to prevent vmalloc from doing GFP_KERNEL allocations. i.e.
> in _xfs_buf_map_pages():

OK, I see...

I'll re-test with the following change:

	--- a/kernel/locking/lockdep.c
	+++ b/kernel/locking/lockdep.c
	@@ -2867,7 +2867,7 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
			return;
	 
		/* We're only interested __GFP_FS allocations for now */
	-       if (!(gfp_mask & __GFP_FS))
	+       if ((curr->flags & PF_MEMALLOC_NOIO) || !(gfp_mask & __GFP_FS))
			return;


Hmm. This is off-topic and most probably I missed something... but at
first glance we can simplify/improve the reclaim-fs lockdep annotations:

1. add the global "struct lockdep_map reclaim_fs_map"

2. change __lockdep_trace_alloc

	-	mark_held_locks(curr, RECLAIM_FS);
	+	lock_map_acquire(&reclaim_fs_map);
	+	lock_map_release(&reclaim_fs_map);

3. turn lockdep_set/clear_current_reclaim_state() into

	void lockdep_set_current_reclaim_state(gfp_t gfp_mask)
	{
		if (gfp_mask & __GFP_FS)
			lock_map_acquire(&reclaim_fs_map);
	}

	void lockdep_clear_current_reclaim_state(gfp_t gfp_mask)
	{
		if (gfp_mask & __GFP_FS)
			lock_map_release(&reclaim_fs_map);
	}

and now we can remove task_struct->lockdep_reclaim_gfp and all other
RECLAIM_FS hacks in lockdep.c. Plus we can easily extend this logic to
check more GFP_ flags.

No?

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara Oct. 6, 2016, 7:27 a.m. UTC | #12
On Wed 05-10-16 18:44:32, Oleg Nesterov wrote:
> On 10/05, Dave Chinner wrote:
> >
> > On Tue, Oct 04, 2016 at 01:43:43PM +0200, Oleg Nesterov wrote:
> >
> > > plus the following warnings:
> > >
> > > 	[ 1894.500040] run fstests generic/070 at 2016-10-04 05:03:39
> > > 	[ 1895.076655] =================================
> > > 	[ 1895.077136] [ INFO: inconsistent lock state ]
> > > 	[ 1895.077574] 4.8.0 #1 Not tainted
> > > 	[ 1895.077900] ---------------------------------
> > > 	[ 1895.078330] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
> > > 	[ 1895.078993] fsstress/18239 [HC0[0]:SC0[0]:HE1:SE1] takes:
> > > 	[ 1895.079522]  (&xfs_nondir_ilock_class){++++?-}, at: [<ffffffffc049ad45>] xfs_ilock+0x165/0x210 [xfs]
> > > 	[ 1895.080529] {IN-RECLAIM_FS-W} state was registered at:
> >
> > And that is a bug in the lockdep annotations for memory allocation because it
> > fails to take into account the current task flags that are set via
> > memalloc_noio_save() to prevent vmalloc from doing GFP_KERNEL allocations. i.e.
> > in _xfs_buf_map_pages():
> 
> OK, I see...
> 
> I'll re-test with the following change:
> 
> 	--- a/kernel/locking/lockdep.c
> 	+++ b/kernel/locking/lockdep.c
> 	@@ -2867,7 +2867,7 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
> 			return;
> 	 
> 		/* We're only interested __GFP_FS allocations for now */
> 	-       if (!(gfp_mask & __GFP_FS))
> 	+       if ((curr->flags & PF_MEMALLOC_NOIO) || !(gfp_mask & __GFP_FS))
> 			return;
> 
> 
> Hmm. This is off-topic and most probably I missed something... but at
> first glance we can simplify/improve the reclaim-fs lockdep annotations:
> 
> 1. add the global "struct lockdep_map reclaim_fs_map"
> 
> 2. change __lockdep_trace_alloc
> 
> 	-	mark_held_locks(curr, RECLAIM_FS);
> 	+	lock_map_acquire(&reclaim_fs_map);
> 	+	lock_map_release(&reclaim_fs_map);
> 
> 3. turn lockdep_set/clear_current_reclaim_state() into
> 
> 	void lockdep_set_current_reclaim_state(gfp_t gfp_mask)
> 	{
> 		if (gfp_mask & __GFP_FS)
> 			lock_map_acquire(&reclaim_fs_map);
> 	}
> 
> 	void lockdep_clear_current_reclaim_state(gfp_t gfp_mask)
> 	{
> 		if (gfp_mask & __GFP_FS)
> 			lock_map_release(&reclaim_fs_map);
> 	}
> 
> and now we can remove task_struct->lockdep_reclaim_gfp and all other
> RECLAIM_FS hacks in lockdep.c. Plus we can easily extend this logic to
> check more GFP_ flags.

Yeah, looks possible to me. I've added Peter to CC since he's most likely
to know.

								Honza
Johannes Weiner Oct. 6, 2016, 1:44 p.m. UTC | #13
On Tue, Oct 04, 2016 at 01:48:00PM +0200, Michal Hocko wrote:
> Johannes is already looking into this
> http://lkml.kernel.org/r/20161004093216.GA21170@cmpxchg.org
> 
> On Tue 04-10-16 13:43:43, Oleg Nesterov wrote:
> > because of kernel bug:
> > 
> > 	[ 2730.242537] run fstests generic/274 at 2016-10-04 05:17:34
> > 	[ 2730.738352] XFS (loop1): Mounting V5 Filesystem
> > 	[ 2730.741451] XFS (loop1): Ending clean mount
> > 	[ 2744.508698] ------------[ cut here ]------------
> > 	[ 2744.509190] kernel BUG at ./include/linux/swap.h:276!
> > 
> > static inline void workingset_node_shadows_dec(struct radix_tree_node *node)
> > {
> > 	VM_BUG_ON(!workingset_node_shadows(node));
> > 	node->count -= 1U << RADIX_TREE_COUNT_SHIFT;

We tracked this one down and Linus merged a fix for this issue:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d3798ae8c6f3767c726403c2ca6ecc317752c9dd

Let me know if this still fires on kernels with that commit.
Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Nesterov Oct. 7, 2016, 4:52 p.m. UTC | #14
On 10/06, Johannes Weiner wrote:
>
> On Tue, Oct 04, 2016 at 01:48:00PM +0200, Michal Hocko wrote:
> > Johannes is already looking into this
> > http://lkml.kernel.org/r/20161004093216.GA21170@cmpxchg.org
> >
> > On Tue 04-10-16 13:43:43, Oleg Nesterov wrote:
> > > because of kernel bug:
> > >
> > > 	[ 2730.242537] run fstests generic/274 at 2016-10-04 05:17:34
> > > 	[ 2730.738352] XFS (loop1): Mounting V5 Filesystem
> > > 	[ 2730.741451] XFS (loop1): Ending clean mount
> > > 	[ 2744.508698] ------------[ cut here ]------------
> > > 	[ 2744.509190] kernel BUG at ./include/linux/swap.h:276!
> > >
> > > static inline void workingset_node_shadows_dec(struct radix_tree_node *node)
> > > {
> > > 	VM_BUG_ON(!workingset_node_shadows(node));
> > > 	node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
>
> We tracked this one down and Linus merged a fix for this issue:
>
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d3798ae8c6f3767c726403c2ca6ecc317752c9dd

Confirm, generic/274 no longer crashes the kernel.

Thanks.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/super.c b/fs/super.c
index 2549896c..0447afe 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1214,25 +1214,34 @@  EXPORT_SYMBOL(__sb_start_write);
 static void sb_wait_write(struct super_block *sb, int level)
 {
 	percpu_down_write(sb->s_writers.rw_sem + level-1);
-	/*
-	 * We are going to return to userspace and forget about this lock, the
-	 * ownership goes to the caller of thaw_super() which does unlock.
-	 *
-	 * FIXME: we should do this before return from freeze_super() after we
-	 * called sync_filesystem(sb) and s_op->freeze_fs(sb), and thaw_super()
-	 * should re-acquire these locks before s_op->unfreeze_fs(sb). However
-	 * this leads to lockdep false-positives, so currently we do the early
-	 * release right after acquire.
-	 */
-	percpu_rwsem_release(sb->s_writers.rw_sem + level-1, 0, _THIS_IP_);
 }
 
-static void sb_freeze_unlock(struct super_block *sb)
+/*
+ * We are going to return to userspace and forget about these locks, the
+ * ownership goes to the caller of thaw_super() which does unlock().
+ */
+static void lockdep_sb_freeze_release(struct super_block *sb)
+{
+	int level;
+
+	for (level = SB_FREEZE_LEVELS - 1; level >= 0; level--)
+		percpu_rwsem_release(sb->s_writers.rw_sem + level, 0, _THIS_IP_);
+}
+
+/*
+ * Tell lockdep we are holding these locks before we call ->unfreeze_fs(sb).
+ */
+static void lockdep_sb_freeze_acquire(struct super_block *sb)
 {
 	int level;
 
 	for (level = 0; level < SB_FREEZE_LEVELS; ++level)
 		percpu_rwsem_acquire(sb->s_writers.rw_sem + level, 0, _THIS_IP_);
+}
+
+static void sb_freeze_unlock(struct super_block *sb)
+{
+	int level;
 
 	for (level = SB_FREEZE_LEVELS - 1; level >= 0; level--)
 		percpu_up_write(sb->s_writers.rw_sem + level);
@@ -1328,6 +1337,7 @@  int freeze_super(struct super_block *sb)
 	 * when frozen is set to SB_FREEZE_COMPLETE, and for thaw_super().
 	 */
 	sb->s_writers.frozen = SB_FREEZE_COMPLETE;
+	lockdep_sb_freeze_release(sb);
 	up_write(&sb->s_umount);
 	return 0;
 }
@@ -1354,11 +1364,14 @@  int thaw_super(struct super_block *sb)
 		goto out;
 	}
 
+	lockdep_sb_freeze_acquire(sb);
+
 	if (sb->s_op->unfreeze_fs) {
 		error = sb->s_op->unfreeze_fs(sb);
 		if (error) {
 			printk(KERN_ERR
 				"VFS:Filesystem thaw failed\n");
+			lockdep_sb_freeze_release(sb);
 			up_write(&sb->s_umount);
 			return error;
 		}