From patchwork Tue Aug 11 13:16:26 2015
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Oleg Nesterov <oleg@redhat.com>
X-Patchwork-Id: 6992321
Return-Path: <linux-fsdevel-owner@kernel.org>
X-Original-To: patchwork-linux-fsdevel@patchwork.kernel.org
Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.136])
	by patchwork2.web.kernel.org (Postfix) with ESMTP id 79468C05AC
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
	Tue, 11 Aug 2015 13:18:59 +0000 (UTC)
Received: from mail.kernel.org (localhost [127.0.0.1])
	by mail.kernel.org (Postfix) with ESMTP id 7176D20552
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
	Tue, 11 Aug 2015 13:18:58 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 5C7B020549
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
	Tue, 11 Aug 2015 13:18:57 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1755278AbbHKNSl (ORCPT
	<rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
	Tue, 11 Aug 2015 09:18:41 -0400
Received: from mx1.redhat.com ([209.132.183.28]:49325 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1752281AbbHKNSk (ORCPT <rfc822;linux-fsdevel@vger.kernel.org>);
	Tue, 11 Aug 2015 09:18:40 -0400
Received: from int-mx10.intmail.prod.int.phx2.redhat.com
	(int-mx10.intmail.prod.int.phx2.redhat.com [10.5.11.23])
	by mx1.redhat.com (Postfix) with ESMTPS id 6EAC38F29D;
	Tue, 11 Aug 2015 13:18:40 +0000 (UTC)
Received: from tranklukator.brq.redhat.com (dhcp-1-102.brq.redhat.com
	[10.34.1.102])
	by int-mx10.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with
	SMTP id t7BDIbZS023119; Tue, 11 Aug 2015 09:18:38 -0400
Received: by tranklukator.brq.redhat.com (nbSMTP-1.00) for uid 500
	oleg@redhat.com; Tue, 11 Aug 2015 15:16:29 +0200 (CEST)
Date: Tue, 11 Aug 2015 15:16:26 +0200
From: Oleg Nesterov <oleg@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>, Al Viro <viro@zeniv.linux.org.uk>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Peter Zijlstra <peterz@infradead.org>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 0/4] change sb_writers to use percpu_rw_semaphore
Message-ID: <20150811131626.GA19780@redhat.com>
References: <20150722211513.GA19986@redhat.com>
	<20150807195552.GB28529@redhat.com>
	<20150810145942.GF3768@quack.suse.cz>
	<20150810224154.GK3902@dastard>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <20150810224154.GK3902@dastard>
User-Agent: Mutt/1.5.18 (2008-05-17)
X-Scanned-By: MIMEDefang 2.68 on 10.5.11.23
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org
X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_HI,
	RP_MATCHES_RCVD,
	UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

On 08/11, Dave Chinner wrote:
>
> On Mon, Aug 10, 2015 at 04:59:42PM +0200, Jan Kara wrote:
> >
> > One would like to construct the lock chain as:
> >
> > CPU0 (chown foo dir)	CPU1 (readdir dir)	CPU2 (page fault)
> > process Y		process X, thread 0	process X, thread 1
> >
> > 			get ILOCK for dir
> > gets freeze protection
> > starts transaction in xfs_setattr_nonsize
> > waits to get ILOCK on 'dir'
> > 						get mmap_sem for X
> > 			wait for mmap_sem for process X
> > 			  in filldir()
> > 						wait for freeze protection in
> > 						  xfs_page_mkwrite
> >
> > and CPU3 then being in freeze_super() blocking CPU2 and waiting for CPU0 to
> > finish it's freeze-protected section. But this cannot happen. The reason is
> > that we block writers level-by-level and thus while there are writers at
> > level X, we do not block writers at level X+1. So in this particular case
> > freeze_super() will block waiting for CPU0 to finish its freeze protected
> > section while CPU2 is free to continue.
> >
> > In general we have a chain like
> >
> > freeze L0 -> freeze L1 -> freeze L2 -> ILOCK -> mmap_sem --\
> >                 A                                          |
> >                 \------------------------------------------/
> >
> > But since ILOCK is always acquired with freeze protection at L0 and we can
> > block at L1 only after there are no writers at L0, this loop can never
> > happen.
> >
> > Note that if we use the property of freezing that lock at level X+1 cannot
> > block when we hold lock at level X, we can as well simplify the dependency
> > graph and track in it only the lowest level of freeze lock that is
> > currently acquired (since the levels above it cannot block and do not in
> > any way influence blocking of other processes either and thus are
> > irrelevant for the purpose of deadlock detection). Then the dependency
> > graph we'd get would be:
> >
> > freeze L0 -> ILOCK -> mmap_sem -> freeze L1
> >
> > and we have a nice acyclic graph we like to see... So probably we have to
> > hack the lockdep instrumentation some more and just don't tell lockdep
> > about freeze locks at higher levels if we already hold a lock at lower
> > level. Thoughts?
>
> The XFS directory ilock->filldir->might_fault locking path has been
> generating false positives in quite a lot of places because of
> things we do on one side of the mmap_sem in filesystem paths vs
> thigs we do on the other side of the mmap_sem in the page fault
> path.

OK. Dave, Jan, thanks a lot.

I was also confused because I didn't know that "Chain exists of" part
of print_circular_bug() only prints the _partial_ chain, and I have
to admit that I do not even understand which part it actually shows...

I'll drop

	move rwsem_release() from sb_wait_write() to freeze_super()
	change thaw_super() to re-acquire s_writers.lock_map

from the previous series and resend everything. Lets change sb_writers to
use percpu_rw_semaphore first, then try to improve the lockdep annotations.

See the interdiff below. With this change I have

	TEST_DEV=/dev/loop0 TEST_DIR=TEST SCRATCH_DEV=/dev/loop1 SCRATCH_MNT=SCRATCH \
	./check `grep -il freeze tests/*/???`

	...

	Ran: generic/068 generic/085 generic/280 generic/311 xfs/011 xfs/119 xfs/297
	Passed all 7 tests

anything else I should test?

Oleg.

this needs a comment in sb_wait_write() to explain that this is not what
we want.
---
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--- a/fs/super.c
+++ b/fs/super.c
@@ -1215,27 +1215,15 @@ EXPORT_SYMBOL(__sb_start_write);
 static void sb_wait_write(struct super_block *sb, int level)
 {
 	percpu_down_write(sb->s_writers.rw_sem + level-1);
+	percpu_rwsem_release(sb->s_writers.rw_sem + level-1, 0, _THIS_IP_);
 }
 
-static void sb_freeze_release(struct super_block *sb)
-{
-	int level;
-	/* Avoid the warning from lockdep_sys_exit() */
-	for (level = 0; level < SB_FREEZE_LEVELS; ++level)
-		percpu_rwsem_release(sb->s_writers.rw_sem + level, 0, _THIS_IP_);
-}
-
-static void sb_freeze_acquire(struct super_block *sb)
+static void sb_freeze_unlock(struct super_block *sb)
 {
 	int level;
 
 	for (level = 0; level < SB_FREEZE_LEVELS; ++level)
 		percpu_rwsem_acquire(sb->s_writers.rw_sem + level, 0, _THIS_IP_);
-}
-
-static void sb_freeze_unlock(struct super_block *sb)
-{
-	int level;
 
 	for (level = SB_FREEZE_LEVELS; --level >= 0; )
 		percpu_up_write(sb->s_writers.rw_sem + level);
@@ -1331,7 +1319,6 @@ int freeze_super(struct super_block *sb)
 	 * sees write activity when frozen is set to SB_FREEZE_COMPLETE.
 	 */
 	sb->s_writers.frozen = SB_FREEZE_COMPLETE;
-	sb_freeze_release(sb);
 	up_write(&sb->s_umount);
 	return 0;
 }
@@ -1358,14 +1345,11 @@ int thaw_super(struct super_block *sb)
 		goto out;
 	}
 
-	sb_freeze_acquire(sb);
-
 	if (sb->s_op->unfreeze_fs) {
 		error = sb->s_op->unfreeze_fs(sb);
 		if (error) {
 			printk(KERN_ERR
 				"VFS:Filesystem thaw failed\n");
-			sb_freeze_release(sb);
 			up_write(&sb->s_umount);
 			return error;
 		}