[4/8] NFS: flush out dirty data on file fput().

Message ID	150304037195.30218.15740287358704674869.stgit@noble (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-nfs-owner@kernel.org> From: NeilBrown <neilb@suse.com> To: Trond Myklebust <trond.myklebust@primarydata.com>, Anna Schumaker <anna.schumaker@netapp.com> Date: Fri, 18 Aug 2017 17:12:51 +1000 Subject: [PATCH 4/8] NFS: flush out dirty data on file fput(). Cc: linux-nfs@vger.kernel.org Message-ID: <150304037195.30218.15740287358704674869.stgit@noble> In-Reply-To: <150304014011.30218.1636255532744321171.stgit@noble> References: <150304014011.30218.1636255532744321171.stgit@noble> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: linux-nfs-owner@vger.kernel.org Precedence: bulk

NeilBrown Aug. 18, 2017, 7:12 a.m. UTC

Any dirty NFS page holds an s_active reference on the superblock,
because page_private() references an nfs_page, which references an
open context, which references the superblock.

So if there are any dirty pages when the filesystem is unmounted, the
unmount will act like a "lazy" unmount and not call ->kill_sb().
Background write-back can then write out the pages *after* the filesystem
unmount has apparently completed.

This contrasts with other filesystems which do not hold extra s_active
references, so ->kill_sb() is reliably called on unmount, and
generic_shutdown_super() will call sync_filesystem() to flush
everything out before the unmount completes.

When open/write/close is used to modify files, the final close causes
f_op->flush to be called, which flushes all dirty pages.  However if
open/mmap/close/modify-memory/unmap is used, dirty pages can remain in
memory after the application has dropped all references to the file.
Similarly if a loop-mount is done with a NFS file, there is no final
flush when the loop device is destroyed and the file can have dirty
data after the last "close".

Also, a loop-back mount of a device does not "close" the file when the
loop device is destroyed.  This can leave dirty page cache pages too.

Fix this by calling vfs_fsync() in nfs_file_release (aka
f_op->release()).  This means that on the final unmap of a file (or
destruction of a loop device), all changes are flushed, and ensures that
when unmount is requested there will be no dirty pages to delay the
final unmount.

Without this patch, it is not safe to stop or disconnect the NFS
server after all clients have unmounted.  They need to unmount and
call "sync".

Signed-off-by: NeilBrown <neilb@suse.com>
---
 fs/nfs/file.c |    6 ++++++
 1 file changed, 6 insertions(+)



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Trond Myklebust Aug. 18, 2017, 7:15 p.m. UTC | #1

On Fri, 2017-08-18 at 17:12 +1000, NeilBrown wrote:
> Any dirty NFS page holds an s_active reference on the superblock,

> because page_private() references an nfs_page, which references an

> open context, which references the superblock.

> 

> So if there are any dirty pages when the filesystem is unmounted, the

> unmount will act like a "lazy" unmount and not call ->kill_sb().

> Background write-back can then write out the pages *after* the

> filesystem

> unmount has apparently completed.

> 

> This contrasts with other filesystems which do not hold extra

> s_active

> references, so ->kill_sb() is reliably called on unmount, and

> generic_shutdown_super() will call sync_filesystem() to flush

> everything out before the unmount completes.

> 

> When open/write/close is used to modify files, the final close causes

> f_op->flush to be called, which flushes all dirty pages.  However if

> open/mmap/close/modify-memory/unmap is used, dirty pages can remain

> in

> memory after the application has dropped all references to the file.

> Similarly if a loop-mount is done with a NFS file, there is no final

> flush when the loop device is destroyed and the file can have dirty

> data after the last "close".

> 

> Also, a loop-back mount of a device does not "close" the file when

> the

> loop device is destroyed.  This can leave dirty page cache pages too.

> 

> Fix this by calling vfs_fsync() in nfs_file_release (aka

> f_op->release()).  This means that on the final unmap of a file (or

> destruction of a loop device), all changes are flushed, and ensures

> that

> when unmount is requested there will be no dirty pages to delay the

> final unmount.

> 

> Without this patch, it is not safe to stop or disconnect the NFS

> server after all clients have unmounted.  They need to unmount and

> call "sync".

> 

> Signed-off-by: NeilBrown <neilb@suse.com>

> ---

>  fs/nfs/file.c |    6 ++++++

>  1 file changed, 6 insertions(+)

> 

> diff --git a/fs/nfs/file.c b/fs/nfs/file.c

> index af330c31f627..aa883d8b24e6 100644

> --- a/fs/nfs/file.c

> +++ b/fs/nfs/file.c

> @@ -81,6 +81,12 @@ nfs_file_release(struct inode *inode, struct file

> *filp)

>  {

>  	dprintk("NFS: release(%pD2)\n", filp);

>  

> +	if (filp->f_mode & FMODE_WRITE)

> +		/* Ensure dirty mmapped pages are flushed

> +		 * so there will be no dirty pages to

> +		 * prevent an unmount from completing.

> +		 */

> +		vfs_fsync(filp, 0);

>  	nfs_inc_stats(inode, NFSIOS_VFSRELEASE);

>  	nfs_file_clear_open_context(filp);

>  	return 0;


The right fix here is to ensure that umount() flushes the dirty data,
and that it aborts the umount attempt if the flush fails. Otherwise, a
signal to the above fsync call will cause the problem to reoccur.

-- 
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.myklebust@primarydata.com

NeilBrown Aug. 19, 2017, 1:02 a.m. UTC | #2

On Fri, Aug 18 2017, Trond Myklebust wrote:

> On Fri, 2017-08-18 at 17:12 +1000, NeilBrown wrote:
>> Any dirty NFS page holds an s_active reference on the superblock,
>> because page_private() references an nfs_page, which references an
>> open context, which references the superblock.
>> 
>> So if there are any dirty pages when the filesystem is unmounted, the
>> unmount will act like a "lazy" unmount and not call ->kill_sb().
>> Background write-back can then write out the pages *after* the
>> filesystem
>> unmount has apparently completed.
>> 
>> This contrasts with other filesystems which do not hold extra
>> s_active
>> references, so ->kill_sb() is reliably called on unmount, and
>> generic_shutdown_super() will call sync_filesystem() to flush
>> everything out before the unmount completes.
>> 
>> When open/write/close is used to modify files, the final close causes
>> f_op->flush to be called, which flushes all dirty pages.  However if
>> open/mmap/close/modify-memory/unmap is used, dirty pages can remain
>> in
>> memory after the application has dropped all references to the file.
>> Similarly if a loop-mount is done with a NFS file, there is no final
>> flush when the loop device is destroyed and the file can have dirty
>> data after the last "close".
>> 
>> Also, a loop-back mount of a device does not "close" the file when
>> the
>> loop device is destroyed.  This can leave dirty page cache pages too.
>> 
>> Fix this by calling vfs_fsync() in nfs_file_release (aka
>> f_op->release()).  This means that on the final unmap of a file (or
>> destruction of a loop device), all changes are flushed, and ensures
>> that
>> when unmount is requested there will be no dirty pages to delay the
>> final unmount.
>> 
>> Without this patch, it is not safe to stop or disconnect the NFS
>> server after all clients have unmounted.  They need to unmount and
>> call "sync".
>> 
>> Signed-off-by: NeilBrown <neilb@suse.com>
>> ---
>>  fs/nfs/file.c |    6 ++++++
>>  1 file changed, 6 insertions(+)
>> 
>> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
>> index af330c31f627..aa883d8b24e6 100644
>> --- a/fs/nfs/file.c
>> +++ b/fs/nfs/file.c
>> @@ -81,6 +81,12 @@ nfs_file_release(struct inode *inode, struct file
>> *filp)
>>  {
>>  	dprintk("NFS: release(%pD2)\n", filp);
>>  
>> +	if (filp->f_mode & FMODE_WRITE)
>> +		/* Ensure dirty mmapped pages are flushed
>> +		 * so there will be no dirty pages to
>> +		 * prevent an unmount from completing.
>> +		 */
>> +		vfs_fsync(filp, 0);
>>  	nfs_inc_stats(inode, NFSIOS_VFSRELEASE);
>>  	nfs_file_clear_open_context(filp);
>>  	return 0;
>
> The right fix here is to ensure that umount() flushes the dirty data,
> and that it aborts the umount attempt if the flush fails. Otherwise, a
> signal to the above fsync call will cause the problem to reoccur.

The only way to ensure that umount flushes dirty data is to revert
commit 1daef0a86837 ("NFS: Clean up nfs_sb_active/nfs_sb_deactive").
As long as we are taking extra references to s_active, umount won't do
what it ought to do.

I do wonder what we should do when you try to unmount a filesystem
mounted from an inaccessible server.
Blocking is awkward for scripts to work with.  Failing with EBUSY might
be misleading.
I don't think that using MNT_FORCE guarantees success does it?  It would
need to discard any dirty data and abort any other background tasks.

I think I would be in favor of EBUSY rather than a hang, and of making
MNT_FORCE a silver bullet providing there are no open file descriptors
on the filesystem.

Thanks,
NeilBrown

Trond Myklebust Aug. 19, 2017, 1:27 a.m. UTC | #3

T24gU2F0LCAyMDE3LTA4LTE5IGF0IDExOjAyICsxMDAwLCBOZWlsQnJvd24gd3JvdGU6DQo+IE9u
IEZyaSwgQXVnIDE4IDIwMTcsIFRyb25kIE15a2xlYnVzdCB3cm90ZToNCj4gDQo+ID4gT24gRnJp
LCAyMDE3LTA4LTE4IGF0IDE3OjEyICsxMDAwLCBOZWlsQnJvd24gd3JvdGU6DQo+ID4gPiBBbnkg
ZGlydHkgTkZTIHBhZ2UgaG9sZHMgYW4gc19hY3RpdmUgcmVmZXJlbmNlIG9uIHRoZSBzdXBlcmJs
b2NrLA0KPiA+ID4gYmVjYXVzZSBwYWdlX3ByaXZhdGUoKSByZWZlcmVuY2VzIGFuIG5mc19wYWdl
LCB3aGljaCByZWZlcmVuY2VzDQo+ID4gPiBhbg0KPiA+ID4gb3BlbiBjb250ZXh0LCB3aGljaCBy
ZWZlcmVuY2VzIHRoZSBzdXBlcmJsb2NrLg0KPiA+ID4gDQo+ID4gPiBTbyBpZiB0aGVyZSBhcmUg
YW55IGRpcnR5IHBhZ2VzIHdoZW4gdGhlIGZpbGVzeXN0ZW0gaXMgdW5tb3VudGVkLA0KPiA+ID4g
dGhlDQo+ID4gPiB1bm1vdW50IHdpbGwgYWN0IGxpa2UgYSAibGF6eSIgdW5tb3VudCBhbmQgbm90
IGNhbGwgLT5raWxsX3NiKCkuDQo+ID4gPiBCYWNrZ3JvdW5kIHdyaXRlLWJhY2sgY2FuIHRoZW4g
d3JpdGUgb3V0IHRoZSBwYWdlcyAqYWZ0ZXIqIHRoZQ0KPiA+ID4gZmlsZXN5c3RlbQ0KPiA+ID4g
dW5tb3VudCBoYXMgYXBwYXJlbnRseSBjb21wbGV0ZWQuDQo+ID4gPiANCj4gPiA+IFRoaXMgY29u
dHJhc3RzIHdpdGggb3RoZXIgZmlsZXN5c3RlbXMgd2hpY2ggZG8gbm90IGhvbGQgZXh0cmENCj4g
PiA+IHNfYWN0aXZlDQo+ID4gPiByZWZlcmVuY2VzLCBzbyAtPmtpbGxfc2IoKSBpcyByZWxpYWJs
eSBjYWxsZWQgb24gdW5tb3VudCwgYW5kDQo+ID4gPiBnZW5lcmljX3NodXRkb3duX3N1cGVyKCkg
d2lsbCBjYWxsIHN5bmNfZmlsZXN5c3RlbSgpIHRvIGZsdXNoDQo+ID4gPiBldmVyeXRoaW5nIG91
dCBiZWZvcmUgdGhlIHVubW91bnQgY29tcGxldGVzLg0KPiA+ID4gDQo+ID4gPiBXaGVuIG9wZW4v
d3JpdGUvY2xvc2UgaXMgdXNlZCB0byBtb2RpZnkgZmlsZXMsIHRoZSBmaW5hbCBjbG9zZQ0KPiA+
ID4gY2F1c2VzDQo+ID4gPiBmX29wLT5mbHVzaCB0byBiZSBjYWxsZWQsIHdoaWNoIGZsdXNoZXMg
YWxsIGRpcnR5IHBhZ2VzLiAgSG93ZXZlcg0KPiA+ID4gaWYNCj4gPiA+IG9wZW4vbW1hcC9jbG9z
ZS9tb2RpZnktbWVtb3J5L3VubWFwIGlzIHVzZWQsIGRpcnR5IHBhZ2VzIGNhbg0KPiA+ID4gcmVt
YWluDQo+ID4gPiBpbg0KPiA+ID4gbWVtb3J5IGFmdGVyIHRoZSBhcHBsaWNhdGlvbiBoYXMgZHJv
cHBlZCBhbGwgcmVmZXJlbmNlcyB0byB0aGUNCj4gPiA+IGZpbGUuDQo+ID4gPiBTaW1pbGFybHkg
aWYgYSBsb29wLW1vdW50IGlzIGRvbmUgd2l0aCBhIE5GUyBmaWxlLCB0aGVyZSBpcyBubw0KPiA+
ID4gZmluYWwNCj4gPiA+IGZsdXNoIHdoZW4gdGhlIGxvb3AgZGV2aWNlIGlzIGRlc3Ryb3llZCBh
bmQgdGhlIGZpbGUgY2FuIGhhdmUNCj4gPiA+IGRpcnR5DQo+ID4gPiBkYXRhIGFmdGVyIHRoZSBs
YXN0ICJjbG9zZSIuDQo+ID4gPiANCj4gPiA+IEFsc28sIGEgbG9vcC1iYWNrIG1vdW50IG9mIGEg
ZGV2aWNlIGRvZXMgbm90ICJjbG9zZSIgdGhlIGZpbGUNCj4gPiA+IHdoZW4NCj4gPiA+IHRoZQ0K
PiA+ID4gbG9vcCBkZXZpY2UgaXMgZGVzdHJveWVkLiAgVGhpcyBjYW4gbGVhdmUgZGlydHkgcGFn
ZSBjYWNoZSBwYWdlcw0KPiA+ID4gdG9vLg0KPiA+ID4gDQo+ID4gPiBGaXggdGhpcyBieSBjYWxs
aW5nIHZmc19mc3luYygpIGluIG5mc19maWxlX3JlbGVhc2UgKGFrYQ0KPiA+ID4gZl9vcC0+cmVs
ZWFzZSgpKS4gIFRoaXMgbWVhbnMgdGhhdCBvbiB0aGUgZmluYWwgdW5tYXAgb2YgYSBmaWxlDQo+
ID4gPiAob3INCj4gPiA+IGRlc3RydWN0aW9uIG9mIGEgbG9vcCBkZXZpY2UpLCBhbGwgY2hhbmdl
cyBhcmUgZmx1c2hlZCwgYW5kDQo+ID4gPiBlbnN1cmVzDQo+ID4gPiB0aGF0DQo+ID4gPiB3aGVu
IHVubW91bnQgaXMgcmVxdWVzdGVkIHRoZXJlIHdpbGwgYmUgbm8gZGlydHkgcGFnZXMgdG8gZGVs
YXkNCj4gPiA+IHRoZQ0KPiA+ID4gZmluYWwgdW5tb3VudC4NCj4gPiA+IA0KPiA+ID4gV2l0aG91
dCB0aGlzIHBhdGNoLCBpdCBpcyBub3Qgc2FmZSB0byBzdG9wIG9yIGRpc2Nvbm5lY3QgdGhlIE5G
Uw0KPiA+ID4gc2VydmVyIGFmdGVyIGFsbCBjbGllbnRzIGhhdmUgdW5tb3VudGVkLiAgVGhleSBu
ZWVkIHRvIHVubW91bnQNCj4gPiA+IGFuZA0KPiA+ID4gY2FsbCAic3luYyIuDQo+ID4gPiANCj4g
PiA+IFNpZ25lZC1vZmYtYnk6IE5laWxCcm93biA8bmVpbGJAc3VzZS5jb20+DQo+ID4gPiAtLS0N
Cj4gPiA+ICBmcy9uZnMvZmlsZS5jIHwgICAgNiArKysrKysNCj4gPiA+ICAxIGZpbGUgY2hhbmdl
ZCwgNiBpbnNlcnRpb25zKCspDQo+ID4gPiANCj4gPiA+IGRpZmYgLS1naXQgYS9mcy9uZnMvZmls
ZS5jIGIvZnMvbmZzL2ZpbGUuYw0KPiA+ID4gaW5kZXggYWYzMzBjMzFmNjI3Li5hYTg4M2Q4YjI0
ZTYgMTAwNjQ0DQo+ID4gPiAtLS0gYS9mcy9uZnMvZmlsZS5jDQo+ID4gPiArKysgYi9mcy9uZnMv
ZmlsZS5jDQo+ID4gPiBAQCAtODEsNiArODEsMTIgQEAgbmZzX2ZpbGVfcmVsZWFzZShzdHJ1Y3Qg
aW5vZGUgKmlub2RlLCBzdHJ1Y3QNCj4gPiA+IGZpbGUNCj4gPiA+ICpmaWxwKQ0KPiA+ID4gIHsN
Cj4gPiA+ICAJZHByaW50aygiTkZTOiByZWxlYXNlKCVwRDIpXG4iLCBmaWxwKTsNCj4gPiA+ICAN
Cj4gPiA+ICsJaWYgKGZpbHAtPmZfbW9kZSAmIEZNT0RFX1dSSVRFKQ0KPiA+ID4gKwkJLyogRW5z
dXJlIGRpcnR5IG1tYXBwZWQgcGFnZXMgYXJlIGZsdXNoZWQNCj4gPiA+ICsJCSAqIHNvIHRoZXJl
IHdpbGwgYmUgbm8gZGlydHkgcGFnZXMgdG8NCj4gPiA+ICsJCSAqIHByZXZlbnQgYW4gdW5tb3Vu
dCBmcm9tIGNvbXBsZXRpbmcuDQo+ID4gPiArCQkgKi8NCj4gPiA+ICsJCXZmc19mc3luYyhmaWxw
LCAwKTsNCj4gPiA+ICAJbmZzX2luY19zdGF0cyhpbm9kZSwgTkZTSU9TX1ZGU1JFTEVBU0UpOw0K
PiA+ID4gIAluZnNfZmlsZV9jbGVhcl9vcGVuX2NvbnRleHQoZmlscCk7DQo+ID4gPiAgCXJldHVy
biAwOw0KPiA+IA0KPiA+IFRoZSByaWdodCBmaXggaGVyZSBpcyB0byBlbnN1cmUgdGhhdCB1bW91
bnQoKSBmbHVzaGVzIHRoZSBkaXJ0eQ0KPiA+IGRhdGEsDQo+ID4gYW5kIHRoYXQgaXQgYWJvcnRz
IHRoZSB1bW91bnQgYXR0ZW1wdCBpZiB0aGUgZmx1c2ggZmFpbHMuDQo+ID4gT3RoZXJ3aXNlLCBh
DQo+ID4gc2lnbmFsIHRvIHRoZSBhYm92ZSBmc3luYyBjYWxsIHdpbGwgY2F1c2UgdGhlIHByb2Js
ZW0gdG8gcmVvY2N1ci4NCj4gDQo+IFRoZSBvbmx5IHdheSB0byBlbnN1cmUgdGhhdCB1bW91bnQg
Zmx1c2hlcyBkaXJ0eSBkYXRhIGlzIHRvIHJldmVydA0KPiBjb21taXQgMWRhZWYwYTg2ODM3ICgi
TkZTOiBDbGVhbiB1cCBuZnNfc2JfYWN0aXZlL25mc19zYl9kZWFjdGl2ZSIpLg0KPiBBcyBsb25n
IGFzIHdlIGFyZSB0YWtpbmcgZXh0cmEgcmVmZXJlbmNlcyB0byBzX2FjdGl2ZSwgdW1vdW50IHdv
bid0DQo+IGRvDQo+IHdoYXQgaXQgb3VnaHQgdG8gZG8uDQo+IA0KDQpSZXZlcnRpbmcgdGhhdCBw
YXRjaCBpcyBub3Qgc3VmZmljaWVudC4gWW91J2Qgc3RpbGwgbmVlZCB0byBjYWxsDQpzeW5jX2Zp
bGVzeXN0ZW0oKSwgYW5kIGFib3J0IHRoZSB1bW91bnQgaWYgdGhlIGxhdHRlciBmYWlscy4NCg0K
PiBJIGRvIHdvbmRlciB3aGF0IHdlIHNob3VsZCBkbyB3aGVuIHlvdSB0cnkgdG8gdW5tb3VudCBh
IGZpbGVzeXN0ZW0NCj4gbW91bnRlZCBmcm9tIGFuIGluYWNjZXNzaWJsZSBzZXJ2ZXIuDQo+IEJs
b2NraW5nIGlzIGF3a3dhcmQgZm9yIHNjcmlwdHMgdG8gd29yayB3aXRoLiAgRmFpbGluZyB3aXRo
IEVCVVNZDQo+IG1pZ2h0DQo+IGJlIG1pc2xlYWRpbmcuDQo+IEkgZG9uJ3QgdGhpbmsgdGhhdCB1
c2luZyBNTlRfRk9SQ0UgZ3VhcmFudGVlcyBzdWNjZXNzIGRvZXMgaXQ/ICBJdA0KPiB3b3VsZA0K
PiBuZWVkIHRvIGRpc2NhcmQgYW55IGRpcnR5IGRhdGEgYW5kIGFib3J0IGFueSBvdGhlciBiYWNr
Z3JvdW5kIHRhc2tzLg0KDQpNTlRfRk9SQ0Ugd2lsbCBvbmx5IGFib3J0IGFueSBvdXRzdGFuZGlu
ZyBSUEMgY2FsbHMuIEl0IGlzIGNvbnN0cmFpbmVkDQpieSB0aGUgZmFjdCB0aGF0IHdlIGRvbid0
IHNlcmlhbGlzZSB1bW91bnQgYW5kIG1vdW50LCBhbmQgdGhhdCB0aGUNCmZpbGVzeXN0ZW0gb24g
d2hpY2ggd2UgYXJlIG9wZXJhdGluZyBtYXkgYmUgbW91bnRlZCBpbiBzZXZlcmFsIHBsYWNlcw0K
aW4gdGhlIG1haW4gbmFtZXNwYWNlIGFuZCBpbmRlZWQgaW4gc2V2ZXJhbCBwcml2YXRlIG5hbWVz
cGFjZXMgdG8gYm9vdC4NCg0KPiBJIHRoaW5rIEkgd291bGQgYmUgaW4gZmF2b3Igb2YgRUJVU1kg
cmF0aGVyIHRoYW4gYSBoYW5nLCBhbmQgb2YNCj4gbWFraW5nDQo+IE1OVF9GT1JDRSBhIHNpbHZl
ciBidWxsZXQgcHJvdmlkaW5nIHRoZXJlIGFyZSBubyBvcGVuIGZpbGUNCj4gZGVzY3JpcHRvcnMN
Cj4gb24gdGhlIGZpbGVzeXN0ZW0uDQoNCkFsbCBvdGhlciBmaWxlc3lzdGVtcyB0cmlnZ2VyIHN5
bmNfZmlsZXN5c3RlbSgpLCBhbmQgdGhhdCdzIHdoYXQgdGhlDQpWRlMgZXhwZWN0cyB1cyB0byBk
by4gVGhlIHByb2JsZW0gaXMgdGhlIFZGUyBhc3N1bWVzIHRoYXQgY2FsbCB3aWxsDQphbHdheXMg
c3VjY2VlZCwgYW5kIHdvbid0IGV2ZXIgcmV0dXJuIGFuIGVycm9yLg0KDQotLSANClRyb25kIE15
a2xlYnVzdA0KTGludXggTkZTIGNsaWVudCBtYWludGFpbmVyLCBQcmltYXJ5RGF0YQ0KdHJvbmQu
bXlrbGVidXN0QHByaW1hcnlkYXRhLmNvbQ0K

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

NeilBrown Aug. 21, 2017, 1:35 a.m. UTC | #4

On Sat, Aug 19 2017, Trond Myklebust wrote:

> On Sat, 2017-08-19 at 11:02 +1000, NeilBrown wrote:
>> On Fri, Aug 18 2017, Trond Myklebust wrote:
>> 
>> > On Fri, 2017-08-18 at 17:12 +1000, NeilBrown wrote:
>> > > Any dirty NFS page holds an s_active reference on the superblock,
>> > > because page_private() references an nfs_page, which references
>> > > an
>> > > open context, which references the superblock.
>> > > 
>> > > So if there are any dirty pages when the filesystem is unmounted,
>> > > the
>> > > unmount will act like a "lazy" unmount and not call ->kill_sb().
>> > > Background write-back can then write out the pages *after* the
>> > > filesystem
>> > > unmount has apparently completed.
>> > > 
>> > > This contrasts with other filesystems which do not hold extra
>> > > s_active
>> > > references, so ->kill_sb() is reliably called on unmount, and
>> > > generic_shutdown_super() will call sync_filesystem() to flush
>> > > everything out before the unmount completes.
>> > > 
>> > > When open/write/close is used to modify files, the final close
>> > > causes
>> > > f_op->flush to be called, which flushes all dirty pages.  However
>> > > if
>> > > open/mmap/close/modify-memory/unmap is used, dirty pages can
>> > > remain
>> > > in
>> > > memory after the application has dropped all references to the
>> > > file.
>> > > Similarly if a loop-mount is done with a NFS file, there is no
>> > > final
>> > > flush when the loop device is destroyed and the file can have
>> > > dirty
>> > > data after the last "close".
>> > > 
>> > > Also, a loop-back mount of a device does not "close" the file
>> > > when
>> > > the
>> > > loop device is destroyed.  This can leave dirty page cache pages
>> > > too.
>> > > 
>> > > Fix this by calling vfs_fsync() in nfs_file_release (aka
>> > > f_op->release()).  This means that on the final unmap of a file
>> > > (or
>> > > destruction of a loop device), all changes are flushed, and
>> > > ensures
>> > > that
>> > > when unmount is requested there will be no dirty pages to delay
>> > > the
>> > > final unmount.
>> > > 
>> > > Without this patch, it is not safe to stop or disconnect the NFS
>> > > server after all clients have unmounted.  They need to unmount
>> > > and
>> > > call "sync".
>> > > 
>> > > Signed-off-by: NeilBrown <neilb@suse.com>
>> > > ---
>> > >  fs/nfs/file.c |    6 ++++++
>> > >  1 file changed, 6 insertions(+)
>> > > 
>> > > diff --git a/fs/nfs/file.c b/fs/nfs/file.c
>> > > index af330c31f627..aa883d8b24e6 100644
>> > > --- a/fs/nfs/file.c
>> > > +++ b/fs/nfs/file.c
>> > > @@ -81,6 +81,12 @@ nfs_file_release(struct inode *inode, struct
>> > > file
>> > > *filp)
>> > >  {
>> > >  	dprintk("NFS: release(%pD2)\n", filp);
>> > >  
>> > > +	if (filp->f_mode & FMODE_WRITE)
>> > > +		/* Ensure dirty mmapped pages are flushed
>> > > +		 * so there will be no dirty pages to
>> > > +		 * prevent an unmount from completing.
>> > > +		 */
>> > > +		vfs_fsync(filp, 0);
>> > >  	nfs_inc_stats(inode, NFSIOS_VFSRELEASE);
>> > >  	nfs_file_clear_open_context(filp);
>> > >  	return 0;
>> > 
>> > The right fix here is to ensure that umount() flushes the dirty
>> > data,
>> > and that it aborts the umount attempt if the flush fails.
>> > Otherwise, a
>> > signal to the above fsync call will cause the problem to reoccur.
>> 
>> The only way to ensure that umount flushes dirty data is to revert
>> commit 1daef0a86837 ("NFS: Clean up nfs_sb_active/nfs_sb_deactive").
>> As long as we are taking extra references to s_active, umount won't
>> do
>> what it ought to do.
>> 
>
> Reverting that patch is not sufficient. You'd still need to call
> sync_filesystem(), and abort the umount if the latter fails.

generic_shutdown_super() (which nfs_kill_super() calls) already calls
sync_filesystem().  However sync_filesystem() cannot fail - it just
asks bdi workers to flush data out, then waits for completion.

If the server is working, this should complete correctly.  If it isn't
this will block indefinitely in the same way that a "sync" following the
umount will block indefinitely today.

So while I agree that reverting this isn't sufficient, I think it might
be a step in the right direction.

I've been experimenting with what happens when the server does down and
you try to unmount.  With NFSv4, the state manager can still be trying
to send a RENEW and not want to let go until it gets an answer.  I
haven't traced it all through yet to understand why or how significant
this is.


>
>> I do wonder what we should do when you try to unmount a filesystem
>> mounted from an inaccessible server.
>> Blocking is awkward for scripts to work with.  Failing with EBUSY
>> might
>> be misleading.
>> I don't think that using MNT_FORCE guarantees success does it?  It
>> would
>> need to discard any dirty data and abort any other background tasks.
>
> MNT_FORCE will only abort any outstanding RPC calls. It is constrained
> by the fact that we don't serialise umount and mount, and that the
> filesystem on which we are operating may be mounted in several places
> in the main namespace and indeed in several private namespaces to
> boot.

Yes, that is a problem.
If we could arrange that SIGKILL *always* kills the process so that it
drops the reference to the filesystem, we could possible delay the
MNT_FORCE handling until deactivate_super(), but at present we sometimes
need MNT_FORCE to allow KILLed processes to die, and while there are
processes with file descriptors, umount will not get as far as
deactivate_super().
(We need filemap_fdatawait_range() to be killable to do better.)

>
>> I think I would be in favor of EBUSY rather than a hang, and of
>> making
>> MNT_FORCE a silver bullet providing there are no open file
>> descriptors
>> on the filesystem.
>
> All other filesystems trigger sync_filesystem(), and that's what the
> VFS expects us to do. The problem is the VFS assumes that call will
> always succeed, and won't ever return an error.

Yes, it is not easy to find an "ideal" solution.

Given that fact, and given that the present patch does address an easily
demonstrated symptom, would you accept it as a small step in a good
direction?  When the server is still accessible, and when no-one kills
processes at awkward times, it definitely helps.

Even with a SIGKILL, I think the WRITEs and the COMMIT will be sent, but
the COMMIT won't be waited for...

Thanks,
NeilBrown

Jeff Layton Aug. 23, 2017, 11:12 a.m. UTC | #5

On Mon, 2017-08-21 at 11:35 +1000, NeilBrown wrote:
> On Sat, Aug 19 2017, Trond Myklebust wrote:
> 
> > On Sat, 2017-08-19 at 11:02 +1000, NeilBrown wrote:
> > > On Fri, Aug 18 2017, Trond Myklebust wrote:
> > > 
> > > > On Fri, 2017-08-18 at 17:12 +1000, NeilBrown wrote:
> > > > > Any dirty NFS page holds an s_active reference on the superblock,
> > > > > because page_private() references an nfs_page, which references
> > > > > an
> > > > > open context, which references the superblock.
> > > > > 
> > > > > So if there are any dirty pages when the filesystem is unmounted,
> > > > > the
> > > > > unmount will act like a "lazy" unmount and not call ->kill_sb().
> > > > > Background write-back can then write out the pages *after* the
> > > > > filesystem
> > > > > unmount has apparently completed.
> > > > > 
> > > > > This contrasts with other filesystems which do not hold extra
> > > > > s_active
> > > > > references, so ->kill_sb() is reliably called on unmount, and
> > > > > generic_shutdown_super() will call sync_filesystem() to flush
> > > > > everything out before the unmount completes.
> > > > > 
> > > > > When open/write/close is used to modify files, the final close
> > > > > causes
> > > > > f_op->flush to be called, which flushes all dirty pages.  However
> > > > > if
> > > > > open/mmap/close/modify-memory/unmap is used, dirty pages can
> > > > > remain
> > > > > in
> > > > > memory after the application has dropped all references to the
> > > > > file.
> > > > > Similarly if a loop-mount is done with a NFS file, there is no
> > > > > final
> > > > > flush when the loop device is destroyed and the file can have
> > > > > dirty
> > > > > data after the last "close".
> > > > > 
> > > > > Also, a loop-back mount of a device does not "close" the file
> > > > > when
> > > > > the
> > > > > loop device is destroyed.  This can leave dirty page cache pages
> > > > > too.
> > > > > 
> > > > > Fix this by calling vfs_fsync() in nfs_file_release (aka
> > > > > f_op->release()).  This means that on the final unmap of a file
> > > > > (or
> > > > > destruction of a loop device), all changes are flushed, and
> > > > > ensures
> > > > > that
> > > > > when unmount is requested there will be no dirty pages to delay
> > > > > the
> > > > > final unmount.
> > > > > 
> > > > > Without this patch, it is not safe to stop or disconnect the NFS
> > > > > server after all clients have unmounted.  They need to unmount
> > > > > and
> > > > > call "sync".
> > > > > 
> > > > > Signed-off-by: NeilBrown <neilb@suse.com>
> > > > > ---
> > > > >  fs/nfs/file.c |    6 ++++++
> > > > >  1 file changed, 6 insertions(+)
> > > > > 
> > > > > diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> > > > > index af330c31f627..aa883d8b24e6 100644
> > > > > --- a/fs/nfs/file.c
> > > > > +++ b/fs/nfs/file.c
> > > > > @@ -81,6 +81,12 @@ nfs_file_release(struct inode *inode, struct
> > > > > file
> > > > > *filp)
> > > > >  {
> > > > >  	dprintk("NFS: release(%pD2)\n", filp);
> > > > >  
> > > > > +	if (filp->f_mode & FMODE_WRITE)
> > > > > +		/* Ensure dirty mmapped pages are flushed
> > > > > +		 * so there will be no dirty pages to
> > > > > +		 * prevent an unmount from completing.
> > > > > +		 */
> > > > > +		vfs_fsync(filp, 0);
> > > > >  	nfs_inc_stats(inode, NFSIOS_VFSRELEASE);
> > > > >  	nfs_file_clear_open_context(filp);
> > > > >  	return 0;
> > > > 
> > > > The right fix here is to ensure that umount() flushes the dirty
> > > > data,
> > > > and that it aborts the umount attempt if the flush fails.
> > > > Otherwise, a
> > > > signal to the above fsync call will cause the problem to reoccur.
> > > 
> > > The only way to ensure that umount flushes dirty data is to revert
> > > commit 1daef0a86837 ("NFS: Clean up nfs_sb_active/nfs_sb_deactive").
> > > As long as we are taking extra references to s_active, umount won't
> > > do
> > > what it ought to do.
> > > 
> > 
> > Reverting that patch is not sufficient. You'd still need to call
> > sync_filesystem(), and abort the umount if the latter fails.
> 
> generic_shutdown_super() (which nfs_kill_super() calls) already calls
> sync_filesystem().  However sync_filesystem() cannot fail - it just
> asks bdi workers to flush data out, then waits for completion.
> 
> If the server is working, this should complete correctly.  If it isn't
> this will block indefinitely in the same way that a "sync" following the
> umount will block indefinitely today.
> 
> So while I agree that reverting this isn't sufficient, I think it might
> be a step in the right direction.
> 
> I've been experimenting with what happens when the server does down and
> you try to unmount.  With NFSv4, the state manager can still be trying
> to send a RENEW and not want to let go until it gets an answer.  I
> haven't traced it all through yet to understand why or how significant
> this is.
> 
> 
> > 
> > > I do wonder what we should do when you try to unmount a filesystem
> > > mounted from an inaccessible server.
> > > Blocking is awkward for scripts to work with.  Failing with EBUSY
> > > might
> > > be misleading.
> > > I don't think that using MNT_FORCE guarantees success does it?  It
> > > would
> > > need to discard any dirty data and abort any other background tasks.
> > 
> > MNT_FORCE will only abort any outstanding RPC calls. It is constrained
> > by the fact that we don't serialise umount and mount, and that the
> > filesystem on which we are operating may be mounted in several places
> > in the main namespace and indeed in several private namespaces to
> > boot.
> 
> Yes, that is a problem.
> If we could arrange that SIGKILL *always* kills the process so that it
> drops the reference to the filesystem, we could possible delay the
> MNT_FORCE handling until deactivate_super(), but at present we sometimes
> need MNT_FORCE to allow KILLed processes to die, and while there are
> processes with file descriptors, umount will not get as far as
> deactivate_super().
> (We need filemap_fdatawait_range() to be killable to do better.)
> 

Indeed. I think we need to consider variants of these that are killable,
and replace the existing calls with the killable variants in the
appropriate places. That's a long term (and difficult) project though...

> > 
> > > I think I would be in favor of EBUSY rather than a hang, and of
> > > making
> > > MNT_FORCE a silver bullet providing there are no open file
> > > descriptors
> > > on the filesystem.
> > 
> > All other filesystems trigger sync_filesystem(), and that's what the
> > VFS expects us to do. The problem is the VFS assumes that call will
> > always succeed, and won't ever return an error.
> 
> Yes, it is not easy to find an "ideal" solution.
> 
> Given that fact, and given that the present patch does address an easily
> demonstrated symptom, would you accept it as a small step in a good
> direction?  When the server is still accessible, and when no-one kills
> processes at awkward times, it definitely helps.
> 
> Even with a SIGKILL, I think the WRITEs and the COMMIT will be sent, but
> the COMMIT won't be waited for...
> 
> Thanks,
> NeilBrown

[4/8] NFS: flush out dirty data on file fput().

Commit Message

Comments

Patch