From patchwork Thu Mar  9 21:38:00 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: NeilBrown <neilb@suse.com>
X-Patchwork-Id: 9614217
Return-Path: <linux-nfs-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	ADE51604D9 for <patchwork-linux-nfs@patchwork.kernel.org>;
	Thu,  9 Mar 2017 21:39:00 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A2222286C0
	for <patchwork-linux-nfs@patchwork.kernel.org>;
	Thu,  9 Mar 2017 21:39:00 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 968BB286CB; Thu,  9 Mar 2017 21:39:00 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00, RCVD_IN_DNSWL_HI,
	T_TVD_MIME_EPI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id F37DC286C0
	for <patchwork-linux-nfs@patchwork.kernel.org>;
	Thu,  9 Mar 2017 21:38:59 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1751405AbdCIVi6 (ORCPT
	<rfc822;patchwork-linux-nfs@patchwork.kernel.org>);
	Thu, 9 Mar 2017 16:38:58 -0500
Received: from mx2.suse.de ([195.135.220.15]:54041 "EHLO mx2.suse.de"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1750803AbdCIVi4 (ORCPT <rfc822;linux-nfs@vger.kernel.org>);
	Thu, 9 Mar 2017 16:38:56 -0500
X-Virus-Scanned: by amavisd-new at test-mx.suse.de
Received: from relay1.suse.de (charybdis-ext.suse.de [195.135.220.254])
	by mx2.suse.de (Postfix) with ESMTP id C33C7AAB9;
	Thu,  9 Mar 2017 21:38:07 +0000 (UTC)
From: NeilBrown <neilb@suse.com>
To: Trond Myklebust <trondmy@primarydata.com>,
	"linux-nfs\@vger.kernel.org" <linux-nfs@vger.kernel.org>
Date: Fri, 10 Mar 2017 08:38:00 +1100
Subject: Re: A NFS mount can still write to the server after 'umount' has
	completed.
In-Reply-To: <1489065706.3121.1.camel@primarydata.com>
References: <871su7i0iw.fsf@notabene.neil.brown.name>
	<87y3wfgipy.fsf@notabene.neil.brown.name>
	<1489065706.3121.1.camel@primarydata.com>
Message-ID: <87pohqgmh3.fsf@notabene.neil.brown.name>
MIME-Version: 1.0
Sender: linux-nfs-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-nfs.vger.kernel.org>
X-Mailing-List: linux-nfs@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

On Thu, Mar 09 2017, Trond Myklebust wrote:

> On Thu, 2017-03-09 at 15:46 +1100, NeilBrown wrote:
>> On Thu, Mar 09 2017, NeilBrown wrote:
>> 
>> > I've been chasing down a problem where a customer has a localhost
>> > mount,
>> > and the sequence
>> >   unmount -at nfs,nfs4
>> >   stop nfsserver
>> >   sync
>> > 
>> > hangs on the sync.  The 'sync' is trying to write to the NFS
>> > filesystem
>> > that has just been unmounted.
>
> So why can't you move the sync up one line?
>
>> > I have duplicated the problem on a current mainline kernel.
>> > 
>> > There are two important facts that lead to the explanation of this.
>> > 1/ whenever a 'struct file' is open, an s_active reference is held
>> > on
>> >    the superblock, via "open_context" calling nfs_sb_active().
>> >    This doesn't prevent "unmount" from succeeding (i.e. EBUSY isn't
>> >    returned), but does prevent the actual unmount from happening
>> >    (->kill_sb() isn't called).
>> > 2/ When a memory mapping of a file is torn down, the file is
>> >    "released", causing the context to be discarded and the
>> > sb_active
>> >    reference released, but unlike close(2), file_operations-
>> > >flush()
>> >    is not called.
>> 
>> I realised that there is another piece of the puzzle.
>> When a page is marked dirty (e.g. nfs_vm_page_mkwrite),
>>  nfs_updatepage() -> nfs_writepage_setup()
>> creates a 'struct nfs_page' request which holds a reference to
>> the open context, which in turn holds an active reference to the
>> superblock.
>> So as long as there are dirty pages, the superblock will not go
>> inactive.  All the rest still holds.
>> 
>> NeilBrown
>> 
>> 
>> > 
>> > Consequently, if you:
>> >  open an NFS file
>> >  mmap some pages PROT_WRITE
>> >  close the file
>> >  modify the pages
>> >  unmap the pages
>> >  unmount the filesystem
>> > 
>> > the filesystem will remain active, and the pages will remain dirty.
>> > If you then make the nfs server unavailable - e.g. stop it, or tear
>> > down
>> > the network connection - and then call 'sync', the sync will hang.
>> > 
>> > This is surprising, at the least :-)
>> > 
>> > I have two ideas how it might be fixed.
>> > 
>> > One is to call nfs_file_flush() from within nfs_file_release().
>> > This is probably simplest (and appears to work).
>> > 
>> > The other is to add a ".close" to nfs_file_vm_ops.  This could
>> > trigger a
>> > (partial) flush whenever a page is unmapped.  As closing an NFS
>> > file
>> > always triggers a flush, it seems reasonable that unmapping a page
>> > would trigger a flush of that page.
>> > 
>> > Thoughts?
>> > 
>
> All this is working as expected. The only way to ensure that data is
> synced to disk by mmap() is to explicitly call msync(). This is well
> documented, and should be well understood by application developers. If
> those developers are ignoring the documentation, then I suggest votoing
> with your feet and getting your software from someone else. 

Hi Trond,
 I don't understand how you can see this behaviour as acceptable.
 For any other filesystem, unmount is a synchronisation point.
 Assuming there are not bind mount or similar, and that MNT_DETACH is
 not used, an unmount will flush out any changes.  It doesn't matter whether
 or not the application has called fsync or msync - the flush still
 happens on unmount.
 So with any other filesystem you can unmount and then remove the media
 and be sure that nothing is consistent and that no further writes will
 be attempted.
 With NFS as it currently is, unmount does not act as a synchronisation
 point, and writes can continue after the unmount.  How is that
 acceptable?

 The patch below explains the problem and provides a simple fix.

Thanks,
NeilBrown

From: NeilBrown <neilb@suse.com>
Date: Fri, 10 Mar 2017 08:19:32 +1100
Subject: [PATCH] NFS: flush out dirty data on final fput().

Any dirty NFS page holds an s_active reference on the superblock,
because page_private() references an nfs_page, which references an
open context, which references the superblock.

So if there are any dirty pages when the filesystem is unmounted, the
unmount will act like a "lazy" unmount and not call ->kill_sb().
Background write-back can then write out the pages *after* the filesystem
unmount has apparently completed.

This contrasts with other filesystems which do not hold extra s_active
references, so ->kill_sb() is reliably called on unmount, and
generic_shutdown_super() will call sync_filesystem() to flush
everything out before the unmount completes.

When open/write/close is used to modify files, the final close causes
f_op->flush to be called, which flushes all dirty pages.  However if
open/mmap/close/modify-memory/unmap is used, dirty pages can remain in
memory after the application has dropped all references to the file.

Fix this by calling vfs_fsync() in nfs_file_release (aka
f_op->release()).  This means that on the final unmap of a file, all
changes are flushed, and ensures that when unmount is requested there
will be no dirty pages to delay the final unmount.

Without this patch, it is not safe to stop or disconnect the NFS
server after all clients have unmounted.  They need to unmount and
call "sync".

Signed-off-by: NeilBrown <neilb@suse.com>
---
 fs/nfs/file.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 668213984d68..d20cf9b3019b 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -81,6 +81,12 @@ nfs_file_release(struct inode *inode, struct file *filp)
 {
 	dprintk("NFS: release(%pD2)\n", filp);
 
+	if (filp->f_mode & FMODE_WRITE)
+		/* Ensure dirty mmapped pages are flushed
+		 * so there will be no dirty pages to
+		 * prevent an unmount from completing.
+		 */
+		vfs_fsync(filp, 0);
 	nfs_inc_stats(inode, NFSIOS_VFSRELEASE);
 	nfs_file_clear_open_context(filp);
 	return 0;