[v4,24/28] NFS: Getattr doesn't require data sync semantics

Message ID	1468817945.5273.2.camel@primarydata.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-nfs-owner@kernel.org> From: Trond Myklebust <trondmy@primarydata.com> To: "hch@infradead.org" <hch@infradead.org> CC: "linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org> Subject: Re: [PATCH v4 24/28] NFS: Getattr doesn't require data sync semantics Thread-Topic: [PATCH v4 24/28] NFS: Getattr doesn't require data sync semantics Thread-Index: AQHR19YZFjW5RsNOqEq6RmGV96Bht6AdnueAgAAMUQCAAAdTgA== Date: Mon, 18 Jul 2016 04:59:09 +0000 Message-ID: <1468817945.5273.2.camel@primarydata.com> References: <1467844205-76852-16-git-send-email-trond.myklebust@primarydata.com> <1467844205-76852-17-git-send-email-trond.myklebust@primarydata.com> <1467844205-76852-18-git-send-email-trond.myklebust@primarydata.com> <1467844205-76852-19-git-send-email-trond.myklebust@primarydata.com> <1467844205-76852-20-git-send-email-trond.myklebust@primarydata.com> <1467844205-76852-21-git-send-email-trond.myklebust@primarydata.com> <1467844205-76852-22-git-send-email-trond.myklebust@primarydata.com> <1467844205-76852-23-git-send-email-trond.myklebust@primarydata.com> <1467844205-76852-24-git-send-email-trond.myklebust@primarydata.com> <1467844205-76852-25-git-send-email-trond.myklebust@primarydata.com> <20160718034847.GA1195@infradead.org> <A9C18137-87A3-4652-ADD6-E8E1C4BAE27B@primarydata.com> In-Reply-To: <A9C18137-87A3-4652-ADD6-E8E1C4BAE27B@primarydata.com> Accept-Language: en-US Content-Language: en-US spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-ID: <8FEAD848A2E95946832BAD9C3CBC8EB9@namprd11.prod.outlook.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: base64 Sender: linux-nfs-owner@vger.kernel.org Precedence: bulk

Trond Myklebust July 18, 2016, 4:59 a.m. UTC

On Mon, 2016-07-18 at 00:32 -0400, Trond Myklebust wrote:
> Hi Christoph,

> 

> > 

> > On Jul 17, 2016, at 23:48, Christoph Hellwig <hch@infradead.org>

> > wrote:

> > 

> > On Wed, Jul 06, 2016 at 06:30:01PM -0400, Trond Myklebust wrote:

> > > 

> > > When retrieving stat() information, NFS unfortunately does

> > > require us to

> > > sync writes to disk in order to ensure that mtime and ctime are

> > > up to

> > > date. However we shouldn't have to ensure that those writes are

> > > persisted.

> > > 

> > > Relaxing that requirement does mean that we may see an

> > > mtime/ctime change

> > > if the server reboots and forces us to replay all writes.

> > > 

> > > The exception to this rule are pNFS clients that are required to

> > > send

> > > layoutcommit, however that is dealt with by the call to

> > > pnfs_sync_inode()

> > > in _nfs_revalidate_inode().

> > 

> > This one breaks xfstests generic/207 on block/scsi layout for

> > me.  The

> > reason for that is that we need a layoutcommit after writing out

> > all

> > data for the file for the file size to be updated on the server.

> > 

> > Below is my attempt to fix this by re-adding pnfs_sync_inode to

> > nfs_getattr.  The call in _nfs_revalidate_inode isn't enough as it

> > doesn't get called in most cases we care about.

> > 

> 

> I’m not understanding this argument. Why do we care if the file size

> is up to date on the server if we’re not sending an actual GETATTR on

> the wire to retrieve the file size?

> 

> Cheers

>   Trond


Actually... The problem might be that a previous attribute update is
marking the attribute cache as being revalidated. Does the following
patch help?

Cheers
  Trond

8<-----------------------------------------------------------
From 10b7e9ad44881fcd46ac24eb7374377c6e8962ed Mon Sep 17 00:00:00 2001
From: Trond Myklebust <trond.myklebust@primarydata.com>

Date: Mon, 18 Jul 2016 00:51:01 -0400
Subject: [PATCH] pNFS: Don't mark the inode as revalidated if a LAYOUTCOMMIT
 is outstanding

We know that the attributes will need updating if there is still a
LAYOUTCOMMIT outstanding.

Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>

---
 fs/nfs/inode.c | 5 ++++-
 fs/nfs/pnfs.h  | 7 +++++++
 2 files changed, 11 insertions(+), 1 deletion(-)

-- 
2.7.4


  
  




Trond Myklebust
Principal System Architect
4300 El Camino Real | Suite 100
Los Altos, CA  94022
W: 650-422-3800
C: 801-921-4583 
www.primarydata.com

Christoph Hellwig July 19, 2016, 3:58 a.m. UTC | #1

On Mon, Jul 18, 2016 at 04:59:09AM +0000, Trond Myklebust wrote:
> Actually... The problem might be that a previous attribute update is
> marking the attribute cache as being revalidated. Does the following
> patch help?

It doesn't.  Also with your most recent linux-next branch the test
now cause the systems to OOM with or without your patch (with mine it's
still fine).  I tested with your writeback branch from about two or
three days ago before, and with that + your patch it also 'just fails'
and doesn't OOM.  Looks like whatever causes the bug also creates
a temporarily memory leak when combined with recent changes from your
tree, most likely something from the pnfs branch.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Benjamin Coddington July 19, 2016, 8 p.m. UTC | #2

On 18 Jul 2016, at 23:58, hch@infradead.org wrote:

> On Mon, Jul 18, 2016 at 04:59:09AM +0000, Trond Myklebust wrote:
>> Actually... The problem might be that a previous attribute update is
>> marking the attribute cache as being revalidated. Does the following
>> patch help?
>
> It doesn't.  Also with your most recent linux-next branch the test
> now cause the systems to OOM with or without your patch (with mine 
> it's
> still fine).  I tested with your writeback branch from about two or
> three days ago before, and with that + your patch it also 'just fails'
> and doesn't OOM.  Looks like whatever causes the bug also creates
> a temporarily memory leak when combined with recent changes from your
> tree, most likely something from the pnfs branch.

I couldn't find the memory leak using kmemleak, but it OOMs pretty 
quick.  If I
insert an mdelay(200) just after the lookup_again: marker in
pnfs_update_layout() it doesn't OOM, but it seems stuck forever in a 
loop on
that marker:

[ 1230.635586] pnfs_find_alloc_layout Begin ino=ffff88003ef986f8 
layout=ffff8800392bca58
[ 1230.636729] pnfs_find_lseg:Begin
[ 1230.637538] pnfs_find_lseg:Return lseg           (null) ref 0
[ 1230.638582] --> send_layoutget
[ 1230.639499] --> nfs4_proc_layoutget
[ 1230.640525] --> nfs4_layoutget_prepare
[ 1230.641479] --> nfs41_setup_sequence
[ 1230.641581] <-- nfs4_proc_layoutget status=-512
[ 1230.643288] --> nfs4_alloc_slot used_slots=0000 
highest_used=4294967295 max_slots=31
[ 1230.644348] <-- nfs4_alloc_slot used_slots=0001 highest_used=0 
slotid=0
[ 1230.645373] <-- nfs41_setup_sequence slotid=0 seqid=4376
[ 1230.646356] <-- nfs4_layoutget_prepare
[ 1230.647357] encode_sequence: sessionid=1468956665:2:3:0 seqid=4376 
slotid=0 max_slotid=0 cache_this=0
[ 1230.648522] encode_layoutget: 1st type:0x5 iomode:2 off:122880 
len:4096 mc:4096
[ 1230.650182] decode_layoutget roff:122880 rlen:4096 riomode:2, 
lo_type:0x5, lo.len:48
[ 1230.651331] --> nfs4_layoutget_done
[ 1230.652233] --> nfs4_alloc_slot used_slots=0001 highest_used=0 
max_slots=31
[ 1230.653409] <-- nfs4_alloc_slot used_slots=0003 highest_used=1 
slotid=1
[ 1230.654547] nfs4_free_slot: slotid 1 highest_used_slotid 0
[ 1230.655606] nfs41_sequence_done: Error 0 free the slot
[ 1230.656635] nfs4_free_slot: slotid 0 highest_used_slotid 4294967295
[ 1230.657739] <-- nfs4_layoutget_done
[ 1230.658650] --> nfs4_layoutget_release
[ 1230.659626] <-- nfs4_layoutget_release

This debug output is identical for every cycle of the loop. Have to stop 
for the
day.. more tomorrow.

Ben
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Trond Myklebust July 19, 2016, 8:06 p.m. UTC | #3

> On Jul 19, 2016, at 16:00, Benjamin Coddington <bcodding@redhat.com> wrote:
> 
> On 18 Jul 2016, at 23:58, hch@infradead.org wrote:
> 
>> On Mon, Jul 18, 2016 at 04:59:09AM +0000, Trond Myklebust wrote:
>>> Actually... The problem might be that a previous attribute update is
>>> marking the attribute cache as being revalidated. Does the following
>>> patch help?
>> 
>> It doesn't.  Also with your most recent linux-next branch the test
>> now cause the systems to OOM with or without your patch (with mine it's
>> still fine).  I tested with your writeback branch from about two or
>> three days ago before, and with that + your patch it also 'just fails'
>> and doesn't OOM.  Looks like whatever causes the bug also creates
>> a temporarily memory leak when combined with recent changes from your
>> tree, most likely something from the pnfs branch.
> 
> I couldn't find the memory leak using kmemleak, but it OOMs pretty quick.  If I
> insert an mdelay(200) just after the lookup_again: marker in
> pnfs_update_layout() it doesn't OOM, but it seems stuck forever in a loop on
> that marker:
> 
> [ 1230.635586] pnfs_find_alloc_layout Begin ino=ffff88003ef986f8 layout=ffff8800392bca58
> [ 1230.636729] pnfs_find_lseg:Begin
> [ 1230.637538] pnfs_find_lseg:Return lseg           (null) ref 0
> [ 1230.638582] --> send_layoutget
> [ 1230.639499] --> nfs4_proc_layoutget
> [ 1230.640525] --> nfs4_layoutget_prepare
> [ 1230.641479] --> nfs41_setup_sequence
> [ 1230.641581] <-- nfs4_proc_layoutget status=-512
> [ 1230.643288] --> nfs4_alloc_slot used_slots=0000 highest_used=4294967295 max_slots=31
> [ 1230.644348] <-- nfs4_alloc_slot used_slots=0001 highest_used=0 slotid=0
> [ 1230.645373] <-- nfs41_setup_sequence slotid=0 seqid=4376
> [ 1230.646356] <-- nfs4_layoutget_prepare
> [ 1230.647357] encode_sequence: sessionid=1468956665:2:3:0 seqid=4376 slotid=0 max_slotid=0 cache_this=0
> [ 1230.648522] encode_layoutget: 1st type:0x5 iomode:2 off:122880 len:4096 mc:4096
> [ 1230.650182] decode_layoutget roff:122880 rlen:4096 riomode:2, lo_type:0x5, lo.len:48
> [ 1230.651331] --> nfs4_layoutget_done
> [ 1230.652233] --> nfs4_alloc_slot used_slots=0001 highest_used=0 max_slots=31
> [ 1230.653409] <-- nfs4_alloc_slot used_slots=0003 highest_used=1 slotid=1
> [ 1230.654547] nfs4_free_slot: slotid 1 highest_used_slotid 0
> [ 1230.655606] nfs41_sequence_done: Error 0 free the slot
> [ 1230.656635] nfs4_free_slot: slotid 0 highest_used_slotid 4294967295
> [ 1230.657739] <-- nfs4_layoutget_done
> [ 1230.658650] --> nfs4_layoutget_release
> [ 1230.659626] <-- nfs4_layoutget_release
> 
> This debug output is identical for every cycle of the loop. Have to stop for the
> day.. more tomorrow.
> 
> Ben
> 

Duh… It’s this patch: pNFS: Fix post-layoutget error handling in pnfs_update_layout()
We have to pass through fatal errors… I’ll fix it.

Cheers
  Trond

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Benjamin Coddington July 19, 2016, 8:09 p.m. UTC | #4

On 19 Jul 2016, at 16:00, Benjamin Coddington wrote:

> On 18 Jul 2016, at 23:58, hch@infradead.org wrote:
>
>> On Mon, Jul 18, 2016 at 04:59:09AM +0000, Trond Myklebust wrote:
>>> Actually... The problem might be that a previous attribute update is
>>> marking the attribute cache as being revalidated. Does the following
>>> patch help?
>>
>> It doesn't.  Also with your most recent linux-next branch the test
>> now cause the systems to OOM with or without your patch (with mine 
>> it's
>> still fine).  I tested with your writeback branch from about two or
>> three days ago before, and with that + your patch it also 'just 
>> fails'
>> and doesn't OOM.  Looks like whatever causes the bug also creates
>> a temporarily memory leak when combined with recent changes from your
>> tree, most likely something from the pnfs branch.
>
> I couldn't find the memory leak using kmemleak, but it OOMs pretty 
> quick.  If I
> insert an mdelay(200) just after the lookup_again: marker in
> pnfs_update_layout() it doesn't OOM, but it seems stuck forever in a 
> loop on
> that marker:
>
> [ 1230.635586] pnfs_find_alloc_layout Begin ino=ffff88003ef986f8 
> layout=ffff8800392bca58
> [ 1230.636729] pnfs_find_lseg:Begin
> [ 1230.637538] pnfs_find_lseg:Return lseg           (null) ref 0
> [ 1230.638582] --> send_layoutget
> [ 1230.639499] --> nfs4_proc_layoutget
> [ 1230.640525] --> nfs4_layoutget_prepare
> [ 1230.641479] --> nfs41_setup_sequence
> [ 1230.641581] <-- nfs4_proc_layoutget status=-512
> [ 1230.643288] --> nfs4_alloc_slot used_slots=0000 
> highest_used=4294967295 max_slots=31
> [ 1230.644348] <-- nfs4_alloc_slot used_slots=0001 highest_used=0 
> slotid=0
> [ 1230.645373] <-- nfs41_setup_sequence slotid=0 seqid=4376
> [ 1230.646356] <-- nfs4_layoutget_prepare
> [ 1230.647357] encode_sequence: sessionid=1468956665:2:3:0 seqid=4376 
> slotid=0 max_slotid=0 cache_this=0
> [ 1230.648522] encode_layoutget: 1st type:0x5 iomode:2 off:122880 
> len:4096 mc:4096
> [ 1230.650182] decode_layoutget roff:122880 rlen:4096 riomode:2, 
> lo_type:0x5, lo.len:48
> [ 1230.651331] --> nfs4_layoutget_done
> [ 1230.652233] --> nfs4_alloc_slot used_slots=0001 highest_used=0 
> max_slots=31
> [ 1230.653409] <-- nfs4_alloc_slot used_slots=0003 highest_used=1 
> slotid=1
> [ 1230.654547] nfs4_free_slot: slotid 1 highest_used_slotid 0
> [ 1230.655606] nfs41_sequence_done: Error 0 free the slot
> [ 1230.656635] nfs4_free_slot: slotid 0 highest_used_slotid 4294967295
> [ 1230.657739] <-- nfs4_layoutget_done
> [ 1230.658650] --> nfs4_layoutget_release
> [ 1230.659626] <-- nfs4_layoutget_release
>
> This debug output is identical for every cycle of the loop.

Except for the monotonically incrementing sequence id!  sorry..  :/

Ben

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Benjamin Coddington July 20, 2016, 3:03 p.m. UTC | #5

On 19 Jul 2016, at 16:06, Trond Myklebust wrote:

>> On Jul 19, 2016, at 16:00, Benjamin Coddington <bcodding@redhat.com> 
>> wrote:
>>
>> On 18 Jul 2016, at 23:58, hch@infradead.org wrote:
>>
>>> On Mon, Jul 18, 2016 at 04:59:09AM +0000, Trond Myklebust wrote:
>>>> Actually... The problem might be that a previous attribute update 
>>>> is
>>>> marking the attribute cache as being revalidated. Does the 
>>>> following
>>>> patch help?
>>>
>>> It doesn't.  Also with your most recent linux-next branch the test
>>> now cause the systems to OOM with or without your patch (with mine 
>>> it's
>>> still fine).  I tested with your writeback branch from about two or
>>> three days ago before, and with that + your patch it also 'just 
>>> fails'
>>> and doesn't OOM.  Looks like whatever causes the bug also creates
>>> a temporarily memory leak when combined with recent changes from 
>>> your
>>> tree, most likely something from the pnfs branch.
>>
>> I couldn't find the memory leak using kmemleak, but it OOMs pretty 
>> quick.  If I
>> insert an mdelay(200) just after the lookup_again: marker in
>> pnfs_update_layout() it doesn't OOM, but it seems stuck forever in a 
>> loop on
>> that marker:
>>
>> [ 1230.635586] pnfs_find_alloc_layout Begin ino=ffff88003ef986f8 
>> layout=ffff8800392bca58
>> [ 1230.636729] pnfs_find_lseg:Begin
>> [ 1230.637538] pnfs_find_lseg:Return lseg           (null) ref 0
>> [ 1230.638582] --> send_layoutget
>> [ 1230.639499] --> nfs4_proc_layoutget
>> [ 1230.640525] --> nfs4_layoutget_prepare
>> [ 1230.641479] --> nfs41_setup_sequence
>> [ 1230.641581] <-- nfs4_proc_layoutget status=-512
>> [ 1230.643288] --> nfs4_alloc_slot used_slots=0000 
>> highest_used=4294967295 max_slots=31
>> [ 1230.644348] <-- nfs4_alloc_slot used_slots=0001 highest_used=0 
>> slotid=0
>> [ 1230.645373] <-- nfs41_setup_sequence slotid=0 seqid=4376
>> [ 1230.646356] <-- nfs4_layoutget_prepare
>> [ 1230.647357] encode_sequence: sessionid=1468956665:2:3:0 seqid=4376 
>> slotid=0 max_slotid=0 cache_this=0
>> [ 1230.648522] encode_layoutget: 1st type:0x5 iomode:2 off:122880 
>> len:4096 mc:4096
>> [ 1230.650182] decode_layoutget roff:122880 rlen:4096 riomode:2, 
>> lo_type:0x5, lo.len:48
>> [ 1230.651331] --> nfs4_layoutget_done
>> [ 1230.652233] --> nfs4_alloc_slot used_slots=0001 highest_used=0 
>> max_slots=31
>> [ 1230.653409] <-- nfs4_alloc_slot used_slots=0003 highest_used=1 
>> slotid=1
>> [ 1230.654547] nfs4_free_slot: slotid 1 highest_used_slotid 0
>> [ 1230.655606] nfs41_sequence_done: Error 0 free the slot
>> [ 1230.656635] nfs4_free_slot: slotid 0 highest_used_slotid 
>> 4294967295
>> [ 1230.657739] <-- nfs4_layoutget_done
>> [ 1230.658650] --> nfs4_layoutget_release
>> [ 1230.659626] <-- nfs4_layoutget_release
>>
>> This debug output is identical for every cycle of the loop. Have to 
>> stop for the
>> day.. more tomorrow.
>>
>> Ben
>>
>
> Duh… It’s this patch: pNFS: Fix post-layoutget error handling in 
> pnfs_update_layout()
> We have to pass through fatal errors… I’ll fix it.

That's indeed fixed it up, and generic/207 passes now.  Thanks!

Ben
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig July 21, 2016, 8:22 a.m. UTC | #6

On Wed, Jul 20, 2016 at 11:03:06AM -0400, Benjamin Coddington wrote:
> > > This debug output is identical for every cycle of the loop. Have to
> > > stop for the
> > > day.. more tomorrow.
> > > 
> > > Ben
> > > 
> > 
> > Duh??? It???s this patch: pNFS: Fix post-layoutget error handling in
> > pnfs_update_layout()
> > We have to pass through fatal errors??? I???ll fix it.
> 
> That's indeed fixed it up, and generic/207 passes now.  Thanks!

So I spoke too soon in my last mail, generic/207 still fails for me
with Trond's linux-next tree, although much later in the test now.

Does it include the changes that are supposed to fix the issue?
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Benjamin Coddington July 21, 2016, 8:32 a.m. UTC | #7

On 21 Jul 2016, at 4:22, hch@infradead.org wrote:

> On Wed, Jul 20, 2016 at 11:03:06AM -0400, Benjamin Coddington wrote:
>>>> This debug output is identical for every cycle of the loop. Have to
>>>> stop for the
>>>> day.. more tomorrow.
>>>>
>>>> Ben
>>>>
>>>
>>> Duh??? It???s this patch: pNFS: Fix post-layoutget error handling in
>>> pnfs_update_layout()
>>> We have to pass through fatal errors??? I???ll fix it.
>>
>> That's indeed fixed it up, and generic/207 passes now.  Thanks!
>
> So I spoke too soon in my last mail, generic/207 still fails for me
> with Trond's linux-next tree, although much later in the test now.
>
> Does it include the changes that are supposed to fix the issue?

It should -- the v2 that fixed 207 for me is 
56b38a1f7c781519eef09c1668a3c97ea911f86b, the first version was 
e35c2a0b3cd062a8941d21511719391b64437427, I think.  I'll test again too.

Ben
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Benjamin Coddington July 21, 2016, 9:10 a.m. UTC | #8

On 21 Jul 2016, at 4:32, Benjamin Coddington wrote:

> On 21 Jul 2016, at 4:22, hch@infradead.org wrote:
>
>> On Wed, Jul 20, 2016 at 11:03:06AM -0400, Benjamin Coddington wrote:
>>>>> This debug output is identical for every cycle of the loop. Have 
>>>>> to
>>>>> stop for the
>>>>> day.. more tomorrow.
>>>>>
>>>>> Ben
>>>>>
>>>>
>>>> Duh??? It???s this patch: pNFS: Fix post-layoutget error handling 
>>>> in
>>>> pnfs_update_layout()
>>>> We have to pass through fatal errors??? I???ll fix it.
>>>
>>> That's indeed fixed it up, and generic/207 passes now.  Thanks!
>>
>> So I spoke too soon in my last mail, generic/207 still fails for me
>> with Trond's linux-next tree, although much later in the test now.
>>
>> Does it include the changes that are supposed to fix the issue?
>
> It should -- the v2 that fixed 207 for me is 
> 56b38a1f7c781519eef09c1668a3c97ea911f86b, the first version was 
> e35c2a0b3cd062a8941d21511719391b64437427, I think.  I'll test again 
> too.

Looks like we're back to the original problem - it fails with the inode 
size is 4k less than expected.

The reason it worked for me was I had pnfs_ld debugging turned up which 
slowed things down enough to somehow catch the right size.

Looks like the right size is returned in the CLOSE, but the inode's not 
getting updated.

Ben
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Benjamin Coddington July 21, 2016, 9:52 a.m. UTC | #9

On 21 Jul 2016, at 5:10, Benjamin Coddington wrote:

> On 21 Jul 2016, at 4:32, Benjamin Coddington wrote:
>
>> On 21 Jul 2016, at 4:22, hch@infradead.org wrote:
>>
>>> On Wed, Jul 20, 2016 at 11:03:06AM -0400, Benjamin Coddington wrote:
>>>>>> This debug output is identical for every cycle of the loop. Have 
>>>>>> to
>>>>>> stop for the
>>>>>> day.. more tomorrow.
>>>>>>
>>>>>> Ben
>>>>>>
>>>>>
>>>>> Duh??? It???s this patch: pNFS: Fix post-layoutget error handling 
>>>>> in
>>>>> pnfs_update_layout()
>>>>> We have to pass through fatal errors??? I???ll fix it.
>>>>
>>>> That's indeed fixed it up, and generic/207 passes now.  Thanks!
>>>
>>> So I spoke too soon in my last mail, generic/207 still fails for me
>>> with Trond's linux-next tree, although much later in the test now.
>>>
>>> Does it include the changes that are supposed to fix the issue?
>>
>> It should -- the v2 that fixed 207 for me is 
>> 56b38a1f7c781519eef09c1668a3c97ea911f86b, the first version was 
>> e35c2a0b3cd062a8941d21511719391b64437427, I think.  I'll test again 
>> too.
>
> Looks like we're back to the original problem - it fails with the 
> inode size is 4k less than expected.
>
> The reason it worked for me was I had pnfs_ld debugging turned up 
> which slowed things down enough to somehow catch the right size.
>
> Looks like the right size is returned in the CLOSE, but the inode's 
> not getting updated.

And the size is right in the last LAYOUTCOMMIT response, of course.  Is 
this the problem?

6024 static int decode_layoutcommit(struct xdr_stream *xdr,
...
6040     sizechanged = be32_to_cpup(p);
6041
6042     if (sizechanged) {
6043         /* throw away new size */
6044         p = xdr_inline_decode(xdr, 8);


Ben
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Trond Myklebust July 21, 2016, 12:46 p.m. UTC | #10

> On Jul 21, 2016, at 05:52, Benjamin Coddington <bcodding@redhat.com> wrote:
> 
> On 21 Jul 2016, at 5:10, Benjamin Coddington wrote:
> 
>> On 21 Jul 2016, at 4:32, Benjamin Coddington wrote:
>> 
>>> On 21 Jul 2016, at 4:22, hch@infradead.org wrote:
>>> 
>>>> On Wed, Jul 20, 2016 at 11:03:06AM -0400, Benjamin Coddington wrote:
>>>>>>> This debug output is identical for every cycle of the loop. Have to
>>>>>>> stop for the
>>>>>>> day.. more tomorrow.
>>>>>>> 
>>>>>>> Ben
>>>>>>> 
>>>>>> 
>>>>>> Duh??? It???s this patch: pNFS: Fix post-layoutget error handling in
>>>>>> pnfs_update_layout()
>>>>>> We have to pass through fatal errors??? I???ll fix it.
>>>>> 
>>>>> That's indeed fixed it up, and generic/207 passes now.  Thanks!
>>>> 
>>>> So I spoke too soon in my last mail, generic/207 still fails for me
>>>> with Trond's linux-next tree, although much later in the test now.
>>>> 
>>>> Does it include the changes that are supposed to fix the issue?
>>> 
>>> It should -- the v2 that fixed 207 for me is 56b38a1f7c781519eef09c1668a3c97ea911f86b, the first version was e35c2a0b3cd062a8941d21511719391b64437427, I think.  I'll test again too.
>> 
>> Looks like we're back to the original problem - it fails with the inode size is 4k less than expected.
>> 
>> The reason it worked for me was I had pnfs_ld debugging turned up which slowed things down enough to somehow catch the right size.
>> 
>> Looks like the right size is returned in the CLOSE, but the inode's not getting updated.
> 
> And the size is right in the last LAYOUTCOMMIT response, of course.  Is this the problem?
> 
> 6024 static int decode_layoutcommit(struct xdr_stream *xdr,
> ...
> 6040     sizechanged = be32_to_cpup(p);
> 6041
> 6042     if (sizechanged) {
> 6043         /* throw away new size */
> 6044         p = xdr_inline_decode(xdr, 8);
> 
> 
> Ben
> 

That shouldn’t really matter since we have a GETATTR immediately following the LAYOUTGET operation. Assuming that nfs4_proc_layoutcommit() actually gets called, it is supposed to update the inode correctly.

Cheers
  Trond

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Benjamin Coddington July 21, 2016, 1:05 p.m. UTC | #11

On 21 Jul 2016, at 8:46, Trond Myklebust wrote:

>> On Jul 21, 2016, at 05:52, Benjamin Coddington <bcodding@redhat.com> 
>> wrote:
>>
>> On 21 Jul 2016, at 5:10, Benjamin Coddington wrote:
>>
>>> On 21 Jul 2016, at 4:32, Benjamin Coddington wrote:
>>>
>>>> On 21 Jul 2016, at 4:22, hch@infradead.org wrote:
>>>>
>>>>> On Wed, Jul 20, 2016 at 11:03:06AM -0400, Benjamin Coddington 
>>>>> wrote:
>>>>>>>> This debug output is identical for every cycle of the loop. 
>>>>>>>> Have to
>>>>>>>> stop for the
>>>>>>>> day.. more tomorrow.
>>>>>>>>
>>>>>>>> Ben
>>>>>>>>
>>>>>>>
>>>>>>> Duh??? It???s this patch: pNFS: Fix post-layoutget error 
>>>>>>> handling in
>>>>>>> pnfs_update_layout()
>>>>>>> We have to pass through fatal errors??? I???ll fix it.
>>>>>>
>>>>>> That's indeed fixed it up, and generic/207 passes now.  Thanks!
>>>>>
>>>>> So I spoke too soon in my last mail, generic/207 still failjs for 
>>>>> me
>>>>> with Trond's linux-next tree, although much later in the test now.
>>>>>
>>>>> Does it include the changes that are supposed to fix the issue?
>>>>
>>>> It should -- the v2 that fixed 207 for me is 
>>>> 56b38a1f7c781519eef09c1668a3c97ea911f86b, the first version was 
>>>> e35c2a0b3cd062a8941d21511719391b64437427, I think.  I'll test again 
>>>> too.
>>>
>>> Looks like we're back to the original problem - it fails with the 
>>> inode size is 4k less than expected.
>>>
>>> The reason it worked for me was I had pnfs_ld debugging turned up 
>>> which slowed things down enough to somehow catch the right size.
>>>
>>> Looks like the right size is returned in the CLOSE, but the inode's 
>>> not getting updated.
>>
>> And the size is right in the last LAYOUTCOMMIT response, of course.  
>> Is this the problem?
>>
>> 6024 static int decode_layoutcommit(struct xdr_stream *xdr,
>> ...
>> 6040     sizechanged = be32_to_cpup(p);
>> 6041
>> 6042     if (sizechanged) {
>> 6043         /* throw away new size */
>> 6044         p = xdr_inline_decode(xdr, 8);
>>
>>
>> Ben
>>
>
> That shouldn’t really matter since we have a GETATTR immediately 
> following the LAYOUTGET operation. Assuming that 
> nfs4_proc_layoutcommit() actually gets called, it is supposed to 
> update the inode correctly.

So back to Christoph's point earlier:

On 17 Jul 2016, at 23:48, Christoph Hellwig wrote:
> This one breaks xfstests generic/207 on block/scsi layout for me.  The
> reason for that is that we need a layoutcommit after writing out all
> data for the file for the file size to be updated on the server.

You responded:

On 18 Jul 2016, at 0:32, Trond Myklebust wrote:
> I’m not understanding this argument. Why do we care if the file size 
> is up
> to date on the server if we’re not sending an actual GETATTR on the 
> wire
> to retrieve the file size?

I guess the answer might be because we can get it back from the last
LAYOUTCOMMIT.

This test has repeated appending 4k and has this pattern on the wire:

NFS 334 V4 Call LAYOUTGET
NFS 290 V4 Reply (Call In 854) LAYOUTCOMMIT
NFS 294 V4 Call GETATTR FH: 0x4f5528b0
NFS 442 V4 Reply (Call In 858) GETATTR
NFS 374 V4 Call LAYOUTCOMMIT
NFS 314 V4 Reply (Call In 856) LAYOUTGET
NFS 334 V4 Call LAYOUTGET
NFS 290 V4 Reply (Call In 860) LAYOUTCOMMIT
NFS 294 V4 Call GETATTR FH: 0x4f5528b0
NFS 442 V4 Reply (Call In 864) GETATTR
NFS 374 V4 Call LAYOUTCOMMIT
NFS 314 V4 Reply (Call In 862) LAYOUTGET
NFS 334 V4 Call LAYOUTGET
NFS 290 V4 Reply (Call In 866) LAYOUTCOMMIT
NFS 294 V4 Call GETATTR FH: 0x4f5528b0
NFS 442 V4 Reply (Call In 870) GETATTR
NFS 314 V4 Reply (Call In 868) LAYOUTGET
NFS 374 V4 Call LAYOUTCOMMIT
NFS 290 V4 Reply (Call In 874) LAYOUTCOMMIT
NFS 314 V4 Call CLOSE StateID: 0x54d9
NFS 294 V4 Reply (Call In 876) CLOSE

That last LAYOUTCOMMIT and the CLOSE have the size we want.  The 
previous
GETATTR is 4k short.

Ben
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Trond Myklebust July 21, 2016, 1:20 p.m. UTC | #12

> On Jul 21, 2016, at 09:05, Benjamin Coddington <bcodding@redhat.com> wrote:

> 

> So back to Christoph's point earlier:

> 

> On 17 Jul 2016, at 23:48, Christoph Hellwig wrote:

>> This one breaks xfstests generic/207 on block/scsi layout for me.  The

>> reason for that is that we need a layoutcommit after writing out all

>> data for the file for the file size to be updated on the server.

> 

> You responded:

> 

> On 18 Jul 2016, at 0:32, Trond Myklebust wrote:

>> I’m not understanding this argument. Why do we care if the file size is up

>> to date on the server if we’re not sending an actual GETATTR on the wire

>> to retrieve the file size?

> 

> I guess the answer might be because we can get it back from the last

> LAYOUTCOMMIT.

> 

The patch that I followed up with should now ensure that we do not mark the attribute cache as up to date if there is a LAYOUTCOMMIT pending.
IOW: when the pNFS write is done, it is expected to do 2 things:

1) mark the inode for LAYOUTCOMMIT
2) mark the attribute cache as invalid (because we know the change attribute, mtime, ctime need to be updates)

In the case of blocks pNFS write:
The call to pnfs_set_layoutcommit() in pnfs_ld_write_done() should take care of (1)
The call to nfs_writeback_update_inode() in nfs4_write_done_cb() should take care of (2).

Provided that these 2 calls are performed in the above order, then any call to nfs_getattr() which has not been preceded by a call to nfs4_proc_layoutcommit() should trigger the call to __nfs_revalidate_inode().

> This test has repeated appending 4k and has this pattern on the wire:

> 

> NFS 334 V4 Call LAYOUTGET

> NFS 290 V4 Reply (Call In 854) LAYOUTCOMMIT

> NFS 294 V4 Call GETATTR FH: 0x4f5528b0

> NFS 442 V4 Reply (Call In 858) GETATTR

> NFS 374 V4 Call LAYOUTCOMMIT

> NFS 314 V4 Reply (Call In 856) LAYOUTGET

> NFS 334 V4 Call LAYOUTGET

> NFS 290 V4 Reply (Call In 860) LAYOUTCOMMIT

> NFS 294 V4 Call GETATTR FH: 0x4f5528b0

> NFS 442 V4 Reply (Call In 864) GETATTR

> NFS 374 V4 Call LAYOUTCOMMIT

> NFS 314 V4 Reply (Call In 862) LAYOUTGET

> NFS 334 V4 Call LAYOUTGET

> NFS 290 V4 Reply (Call In 866) LAYOUTCOMMIT

> NFS 294 V4 Call GETATTR FH: 0x4f5528b0

> NFS 442 V4 Reply (Call In 870) GETATTR

> NFS 314 V4 Reply (Call In 868) LAYOUTGET

> NFS 374 V4 Call LAYOUTCOMMIT

> NFS 290 V4 Reply (Call In 874) LAYOUTCOMMIT

> NFS 314 V4 Call CLOSE StateID: 0x54d9

> NFS 294 V4 Reply (Call In 876) CLOSE

> 

> That last LAYOUTCOMMIT and the CLOSE have the size we want.  The previous

> GETATTR is 4k short.

When you say “4k short”, do you mean that it differs from the value returned by the LAYOUTCOMMIT that immediately precedes it? It looks as if there is a LAYOUTGET immediately following it, so presumably the writes have not all completed yet.

Cheers
  Trond

Trond Myklebust July 21, 2016, 2 p.m. UTC | #13

> On Jul 21, 2016, at 09:20, Trond Myklebust <trondmy@primarydata.com> wrote:

> 

> 

>> On Jul 21, 2016, at 09:05, Benjamin Coddington <bcodding@redhat.com> wrote:

>> 

>> So back to Christoph's point earlier:

>> 

>> On 17 Jul 2016, at 23:48, Christoph Hellwig wrote:

>>> This one breaks xfstests generic/207 on block/scsi layout for me.  The

>>> reason for that is that we need a layoutcommit after writing out all

>>> data for the file for the file size to be updated on the server.

>> 

>> You responded:

>> 

>> On 18 Jul 2016, at 0:32, Trond Myklebust wrote:

>>> I’m not understanding this argument. Why do we care if the file size is up

>>> to date on the server if we’re not sending an actual GETATTR on the wire

>>> to retrieve the file size?

>> 

>> I guess the answer might be because we can get it back from the last

>> LAYOUTCOMMIT.

>> 

> 

> The patch that I followed up with should now ensure that we do not mark the attribute cache as up to date if there is a LAYOUTCOMMIT pending.

> IOW: when the pNFS write is done, it is expected to do 2 things:

> 

> 1) mark the inode for LAYOUTCOMMIT

> 2) mark the attribute cache as invalid (because we know the change attribute, mtime, ctime need to be updates)

> 

> In the case of blocks pNFS write:

> The call to pnfs_set_layoutcommit() in pnfs_ld_write_done() should take care of (1)

> The call to nfs_writeback_update_inode() in nfs4_write_done_cb() should take care of (2).

> 

> Provided that these 2 calls are performed in the above order, then any call to nfs_getattr() which has not been preceded by a call to nfs4_proc_layoutcommit() should trigger the call to __nfs_revalidate_inode().


By the way, it looks as if the ‘files’ layout type fails to do (2). I’ll add a fix for that.

Cheers
  Trond

Benjamin Coddington July 21, 2016, 2:02 p.m. UTC | #14

On 21 Jul 2016, at 9:20, Trond Myklebust wrote:

>> On Jul 21, 2016, at 09:05, Benjamin Coddington <bcodding@redhat.com> 
>> wrote:
>>
>> So back to Christoph's point earlier:
>>
>> On 17 Jul 2016, at 23:48, Christoph Hellwig wrote:
>>> This one breaks xfstests generic/207 on block/scsi layout for me.  
>>> The
>>> reason for that is that we need a layoutcommit after writing out all
>>> data for the file for the file size to be updated on the server.
>>
>> You responded:
>>
>> On 18 Jul 2016, at 0:32, Trond Myklebust wrote:
>>> I’m not understanding this argument. Why do we care if the file 
>>> size is up
>>> to date on the server if we’re not sending an actual GETATTR on 
>>> the wire
>>> to retrieve the file size?
>>
>> I guess the answer might be because we can get it back from the last
>> LAYOUTCOMMIT.
>>
>
> The patch that I followed up with should now ensure that we do not 
> mark the attribute cache as up to date if there is a LAYOUTCOMMIT 
> pending.
> IOW: when the pNFS write is done, it is expected to do 2 things:
>
> 1) mark the inode for LAYOUTCOMMIT
> 2) mark the attribute cache as invalid (because we know the change 
> attribute, mtime, ctime need to be updates)
>
> In the case of blocks pNFS write:
> The call to pnfs_set_layoutcommit() in pnfs_ld_write_done() should 
> take care of (1)
> The call to nfs_writeback_update_inode() in nfs4_write_done_cb() 
> should take care of (2).
>
> Provided that these 2 calls are performed in the above order, then any 
> call to nfs_getattr() which has not been preceded by a call to 
> nfs4_proc_layoutcommit() should trigger the call to 
> __nfs_revalidate_inode().

OK, so maybe things are out of order here..  Thanks - this is helpful.

>> This test has repeated appending 4k and has this pattern on the wire:
>>
>> NFS 334 V4 Call LAYOUTGET
>> NFS 290 V4 Reply (Call In 854) LAYOUTCOMMIT
>> NFS 294 V4 Call GETATTR FH: 0x4f5528b0
>> NFS 442 V4 Reply (Call In 858) GETATTR
>> NFS 374 V4 Call LAYOUTCOMMIT
>> NFS 314 V4 Reply (Call In 856) LAYOUTGET
>> NFS 334 V4 Call LAYOUTGET
>> NFS 290 V4 Reply (Call In 860) LAYOUTCOMMIT
>> NFS 294 V4 Call GETATTR FH: 0x4f5528b0
>> NFS 442 V4 Reply (Call In 864) GETATTR
>> NFS 374 V4 Call LAYOUTCOMMIT
>> NFS 314 V4 Reply (Call In 862) LAYOUTGET
>> NFS 334 V4 Call LAYOUTGET
>> NFS 290 V4 Reply (Call In 866) LAYOUTCOMMIT
>> NFS 294 V4 Call GETATTR FH: 0x4f5528b0
>> NFS 442 V4 Reply (Call In 870) GETATTR
>> NFS 314 V4 Reply (Call In 868) LAYOUTGET
>> NFS 374 V4 Call LAYOUTCOMMIT
>> NFS 290 V4 Reply (Call In 874) LAYOUTCOMMIT
>> NFS 314 V4 Call CLOSE StateID: 0x54d9
>> NFS 294 V4 Reply (Call In 876) CLOSE
>>
>> That last LAYOUTCOMMIT and the CLOSE have the size we want.  The 
>> previous
>> GETATTR is 4k short.
>
> When you say “4k short”, do you mean that it differs from the 
> value returned by the LAYOUTCOMMIT that immediately precedes it? It 
> looks as if there is a LAYOUTGET immediately following it, so 
> presumably the writes have not all completed yet.

I meant it is 4k short of the final size returned in the LAYOUTCOMMIT
immediately following.  It is the same as the LAYOUTCOMMIT immediately
preceding.

FWIW at this point, here's a better look at the network:

tshark -r /tmp/pcap -T fields -e frame.number -e nfs.fh.hash -e 
nfs.opcode -e nfs.fattr4.size 'frame.number > 860'
861     53,22,50
862 0x4f5528b0  53,22,50
863     53,22,49,9  409600
864 0x4f5528b0  53,22,9
865     53,22,9 409600
866 0x4f5528b0  53,22,49,9
867     53,22,50
868 0x4f5528b0  53,22,50
869     53,22,49,9  413696
870 0x4f5528b0  53,22,9
871     53,22,9 413696
872     53,22,50
873
874 0x4f5528b0  53,22,49,9
875     53,22,49,9  417792
876 0x4f5528b0  53,22,4,9
877     53,22,4,9   417792
...
880 0x4f5528b0  53,22,9
881     53,22,9 417792

Now I see there's a GETATTR after the CLOSE that returns the right size 
-- but
I'll really should track down what's happening to that one to see if it 
is the
same call that the test is making.  Unfortunately, I'm getting pulled 
away
again, so I'll dig more later.  Thanks for looking.

Ben

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Benjamin Coddington July 25, 2016, 4:26 p.m. UTC | #15

On 21 Jul 2016, at 9:20, Trond Myklebust wrote:

>> On Jul 21, 2016, at 09:05, Benjamin Coddington <bcodding@redhat.com> 
>> wrote:
>>
>> So back to Christoph's point earlier:
>>
>> On 17 Jul 2016, at 23:48, Christoph Hellwig wrote:
>>> This one breaks xfstests generic/207 on block/scsi layout for me.  
>>> The
>>> reason for that is that we need a layoutcommit after writing out all
>>> data for the file for the file size to be updated on the server.
>>
>> You responded:
>>
>> On 18 Jul 2016, at 0:32, Trond Myklebust wrote:
>>> I’m not understanding this argument. Why do we care if the file 
>>> size is up
>>> to date on the server if we’re not sending an actual GETATTR on 
>>> the wire
>>> to retrieve the file size?
>>
>> I guess the answer might be because we can get it back from the last
>> LAYOUTCOMMIT.
>>
>
> The patch that I followed up with should now ensure that we do not 
> mark the attribute cache as up to date if there is a LAYOUTCOMMIT 
> pending.
> IOW: when the pNFS write is done, it is expected to do 2 things:
>
> 1) mark the inode for LAYOUTCOMMIT
> 2) mark the attribute cache as invalid (because we know the change 
> attribute, mtime, ctime need to be updates)
>
> In the case of blocks pNFS write:
> The call to pnfs_set_layoutcommit() in pnfs_ld_write_done() should 
> take care of (1)
> The call to nfs_writeback_update_inode() in nfs4_write_done_cb() 
> should take care of (2).
>
> Provided that these 2 calls are performed in the above order, then any 
> call to nfs_getattr() which has not been preceded by a call to 
> nfs4_proc_layoutcommit() should trigger the call to 
> __nfs_revalidate_inode().

I think the problem is that a following nfs_getattr() will fail to 
notice
the size change in the case of a write_completion and layoutcommit 
occuring
after nfs_getattr() has done pnfs_sync_inode() but before it has done
nfs_update_inode().

In the failing case there are two threads one is doing writes, the other
doing lstat on aio_complete via io_getevents(2).

For each write completion the lstat thread tries to verify the file 
size.

GETATTR Thread                  LAYOUTCOMMIT Thread
--------------                  --------------------
                                 write_completion sets LAYOUTCOMMIT 
(4096@0)
--> nfs_getattr
  __nfs_revalidate_inode
   pnfs_sync_inode
   getattr sees 4096
                                 write_completion sets LAYOUTCOMMIT 
(4096@4096)
                                 sets LAYOUTCOMMITING
                                 clears LAYOUTCOMMIT
                                 clears LAYOUTCOMMITTING
   nfs_refresh_inode
    nfs_update_inode size is 4096
<-- nfs_getattr

At this point the cached attributes are seen as up to date, but
aio-dio-extend-stat program expects that second write_completion to 
reflect
in the file size.

Ben
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Trond Myklebust July 25, 2016, 4:39 p.m. UTC | #16

DQo+IE9uIEp1bCAyNSwgMjAxNiwgYXQgMTI6MjYsIEJlbmphbWluIENvZGRpbmd0b24gPGJjb2Rk
aW5nQHJlZGhhdC5jb20+IHdyb3RlOg0KPiANCj4gT24gMjEgSnVsIDIwMTYsIGF0IDk6MjAsIFRy
b25kIE15a2xlYnVzdCB3cm90ZToNCj4gDQo+Pj4gT24gSnVsIDIxLCAyMDE2LCBhdCAwOTowNSwg
QmVuamFtaW4gQ29kZGluZ3RvbiA8YmNvZGRpbmdAcmVkaGF0LmNvbT4gd3JvdGU6DQo+Pj4gDQo+
Pj4gU28gYmFjayB0byBDaHJpc3RvcGgncyBwb2ludCBlYXJsaWVyOg0KPj4+IA0KPj4+IE9uIDE3
IEp1bCAyMDE2LCBhdCAyMzo0OCwgQ2hyaXN0b3BoIEhlbGx3aWcgd3JvdGU6DQo+Pj4+IFRoaXMg
b25lIGJyZWFrcyB4ZnN0ZXN0cyBnZW5lcmljLzIwNyBvbiBibG9jay9zY3NpIGxheW91dCBmb3Ig
bWUuICBUaGUNCj4+Pj4gcmVhc29uIGZvciB0aGF0IGlzIHRoYXQgd2UgbmVlZCBhIGxheW91dGNv
bW1pdCBhZnRlciB3cml0aW5nIG91dCBhbGwNCj4+Pj4gZGF0YSBmb3IgdGhlIGZpbGUgZm9yIHRo
ZSBmaWxlIHNpemUgdG8gYmUgdXBkYXRlZCBvbiB0aGUgc2VydmVyLg0KPj4+IA0KPj4+IFlvdSBy
ZXNwb25kZWQ6DQo+Pj4gDQo+Pj4gT24gMTggSnVsIDIwMTYsIGF0IDA6MzIsIFRyb25kIE15a2xl
YnVzdCB3cm90ZToNCj4+Pj4gSeKAmW0gbm90IHVuZGVyc3RhbmRpbmcgdGhpcyBhcmd1bWVudC4g
V2h5IGRvIHdlIGNhcmUgaWYgdGhlIGZpbGUgc2l6ZSBpcyB1cA0KPj4+PiB0byBkYXRlIG9uIHRo
ZSBzZXJ2ZXIgaWYgd2XigJlyZSBub3Qgc2VuZGluZyBhbiBhY3R1YWwgR0VUQVRUUiBvbiB0aGUg
d2lyZQ0KPj4+PiB0byByZXRyaWV2ZSB0aGUgZmlsZSBzaXplPw0KPj4+IA0KPj4+IEkgZ3Vlc3Mg
dGhlIGFuc3dlciBtaWdodCBiZSBiZWNhdXNlIHdlIGNhbiBnZXQgaXQgYmFjayBmcm9tIHRoZSBs
YXN0DQo+Pj4gTEFZT1VUQ09NTUlULg0KPj4+IA0KPj4gDQo+PiBUaGUgcGF0Y2ggdGhhdCBJIGZv
bGxvd2VkIHVwIHdpdGggc2hvdWxkIG5vdyBlbnN1cmUgdGhhdCB3ZSBkbyBub3QgbWFyayB0aGUg
YXR0cmlidXRlIGNhY2hlIGFzIHVwIHRvIGRhdGUgaWYgdGhlcmUgaXMgYSBMQVlPVVRDT01NSVQg
cGVuZGluZy4NCj4+IElPVzogd2hlbiB0aGUgcE5GUyB3cml0ZSBpcyBkb25lLCBpdCBpcyBleHBl
Y3RlZCB0byBkbyAyIHRoaW5nczoNCj4+IA0KPj4gMSkgbWFyayB0aGUgaW5vZGUgZm9yIExBWU9V
VENPTU1JVA0KPj4gMikgbWFyayB0aGUgYXR0cmlidXRlIGNhY2hlIGFzIGludmFsaWQgKGJlY2F1
c2Ugd2Uga25vdyB0aGUgY2hhbmdlIGF0dHJpYnV0ZSwgbXRpbWUsIGN0aW1lIG5lZWQgdG8gYmUg
dXBkYXRlcykNCj4+IA0KPj4gSW4gdGhlIGNhc2Ugb2YgYmxvY2tzIHBORlMgd3JpdGU6DQo+PiBU
aGUgY2FsbCB0byBwbmZzX3NldF9sYXlvdXRjb21taXQoKSBpbiBwbmZzX2xkX3dyaXRlX2RvbmUo
KSBzaG91bGQgdGFrZSBjYXJlIG9mICgxKQ0KPj4gVGhlIGNhbGwgdG8gbmZzX3dyaXRlYmFja191
cGRhdGVfaW5vZGUoKSBpbiBuZnM0X3dyaXRlX2RvbmVfY2IoKSBzaG91bGQgdGFrZSBjYXJlIG9m
ICgyKS4NCj4+IA0KPj4gUHJvdmlkZWQgdGhhdCB0aGVzZSAyIGNhbGxzIGFyZSBwZXJmb3JtZWQg
aW4gdGhlIGFib3ZlIG9yZGVyLCB0aGVuIGFueSBjYWxsIHRvIG5mc19nZXRhdHRyKCkgd2hpY2gg
aGFzIG5vdCBiZWVuIHByZWNlZGVkIGJ5IGEgY2FsbCB0byBuZnM0X3Byb2NfbGF5b3V0Y29tbWl0
KCkgc2hvdWxkIHRyaWdnZXIgdGhlIGNhbGwgdG8gX19uZnNfcmV2YWxpZGF0ZV9pbm9kZSgpLg0K
PiANCj4gSSB0aGluayB0aGUgcHJvYmxlbSBpcyB0aGF0IGEgZm9sbG93aW5nIG5mc19nZXRhdHRy
KCkgd2lsbCBmYWlsIHRvIG5vdGljZQ0KPiB0aGUgc2l6ZSBjaGFuZ2UgaW4gdGhlIGNhc2Ugb2Yg
YSB3cml0ZV9jb21wbGV0aW9uIGFuZCBsYXlvdXRjb21taXQgb2NjdXJpbmcNCj4gYWZ0ZXIgbmZz
X2dldGF0dHIoKSBoYXMgZG9uZSBwbmZzX3N5bmNfaW5vZGUoKSBidXQgYmVmb3JlIGl0IGhhcyBk
b25lDQo+IG5mc191cGRhdGVfaW5vZGUoKS4NCj4gDQo+IEluIHRoZSBmYWlsaW5nIGNhc2UgdGhl
cmUgYXJlIHR3byB0aHJlYWRzIG9uZSBpcyBkb2luZyB3cml0ZXMsIHRoZSBvdGhlcg0KPiBkb2lu
ZyBsc3RhdCBvbiBhaW9fY29tcGxldGUgdmlhIGlvX2dldGV2ZW50cygyKS4NCj4gDQo+IEZvciBl
YWNoIHdyaXRlIGNvbXBsZXRpb24gdGhlIGxzdGF0IHRocmVhZCB0cmllcyB0byB2ZXJpZnkgdGhl
IGZpbGUgc2l6ZS4NCj4gDQo+IEdFVEFUVFIgVGhyZWFkICAgICAgICAgICAgICAgICAgTEFZT1VU
Q09NTUlUIFRocmVhZA0KPiAtLS0tLS0tLS0tLS0tLSAgICAgICAgICAgICAgICAgIC0tLS0tLS0t
LS0tLS0tLS0tLS0tDQo+ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICB3cml0ZV9jb21w
bGV0aW9uIHNldHMgTEFZT1VUQ09NTUlUICg0MDk2QDApDQo+IC0tPiBuZnNfZ2V0YXR0cg0KDQpm
aWxlbWFwX3dyaXRlX2FuZF93YWl0KCkNCg0KPiBfX25mc19yZXZhbGlkYXRlX2lub2RlDQo+ICBw
bmZzX3N5bmNfaW5vZGUNCg0KTkZTX1BST1RPKGlub2RlKS0+Z2V0YXR0cigpDQoNCj4gIGdldGF0
dHIgc2VlcyA0MDk2DQo+ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICB3cml0ZV9jb21w
bGV0aW9uIHNldHMgTEFZT1VUQ09NTUlUICg0MDk2QDQwOTYpDQo+ICAgICAgICAgICAgICAgICAg
ICAgICAgICAgICAgICBzZXRzIExBWU9VVENPTU1JVElORw0KPiAgICAgICAgICAgICAgICAgICAg
ICAgICAgICAgICAgY2xlYXJzIExBWU9VVENPTU1JVA0KPiAgICAgICAgICAgICAgICAgICAgICAg
ICAgICAgICAgY2xlYXJzIExBWU9VVENPTU1JVFRJTkcNCj4gIG5mc19yZWZyZXNoX2lub2RlDQo+
ICAgbmZzX3VwZGF0ZV9pbm9kZSBzaXplIGlzIDQwOTYNCj4gPC0tIG5mc19nZXRhdHRyDQo+IA0K
PiBBdCB0aGlzIHBvaW50IHRoZSBjYWNoZWQgYXR0cmlidXRlcyBhcmUgc2VlbiBhcyB1cCB0byBk
YXRlLCBidXQNCj4gYWlvLWRpby1leHRlbmQtc3RhdCBwcm9ncmFtIGV4cGVjdHMgdGhhdCBzZWNv
bmQgd3JpdGVfY29tcGxldGlvbiB0byByZWZsZWN0DQo+IGluIHRoZSBmaWxlIHNpemUuDQo+IA0K
DQpXaHkgaXNu4oCZdCB0aGUgZmlsZW1hcF93cml0ZV9hbmRfd2FpdCgpIGFib3ZlIHJlc29sdmlu
ZyB0aGUgcmFjZT8gSeKAmWQgZXhwZWN0IHRoYXQgd291bGQgbW92ZSB5b3VyIOKAnHdyaXRlIGNv
bXBsZXRpb24gc2V0cyBMQVlPVVRDT01NSVTigJ0gdXAgdG8gYmVmb3JlIHRoZSBwbmZzX3N5bmNf
aW5vZGUoKS4NCkluIGZhY3QsIGluIHRoZSBwYXRjaCB0aGF0IENocmlzdG9waCBzZW50LCBhbGwg
aGUgd2FzIGRvaW5nIHdhcyBtb3ZpbmcgdGhlIHBuZnNfc3luY19pbm9kZSgpIHRvIGltbWVkaWF0
ZWx5IGFmdGVyIHRoYXQgZmlsZW1hcF93cml0ZV9hbmRfd2FpdCgpIGluc3RlYWQgb2YgcmVseWlu
ZyBvbiBpdCBpbiBfbmZzX3JldmFsaWRhdGVfaW5vZGUu

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Benjamin Coddington July 25, 2016, 6:26 p.m. UTC | #17

On 25 Jul 2016, at 12:39, Trond Myklebust wrote:

>> On Jul 25, 2016, at 12:26, Benjamin Coddington <bcodding@redhat.com> 
>> wrote:
>>
>> On 21 Jul 2016, at 9:20, Trond Myklebust wrote:
>>
>>>> On Jul 21, 2016, at 09:05, Benjamin Coddington 
>>>> <bcodding@redhat.com> wrote:
>>>>
>>>> So back to Christoph's point earlier:
>>>>
>>>> On 17 Jul 2016, at 23:48, Christoph Hellwig wrote:
>>>>> This one breaks xfstests generic/207 on block/scsi layout for me.  
>>>>> The
>>>>> reason for that is that we need a layoutcommit after writing out 
>>>>> all
>>>>> data for the file for the file size to be updated on the server.
>>>>
>>>> You responded:
>>>>
>>>> On 18 Jul 2016, at 0:32, Trond Myklebust wrote:
>>>>> I’m not understanding this argument. Why do we care if the file 
>>>>> size is up
>>>>> to date on the server if we’re not sending an actual GETATTR on 
>>>>> the wire
>>>>> to retrieve the file size?
>>>>
>>>> I guess the answer might be because we can get it back from the 
>>>> last
>>>> LAYOUTCOMMIT.
>>>>
>>>
>>> The patch that I followed up with should now ensure that we do not 
>>> mark the attribute cache as up to date if there is a LAYOUTCOMMIT 
>>> pending.
>>> IOW: when the pNFS write is done, it is expected to do 2 things:
>>>
>>> 1) mark the inode for LAYOUTCOMMIT
>>> 2) mark the attribute cache as invalid (because we know the change 
>>> attribute, mtime, ctime need to be updates)
>>>
>>> In the case of blocks pNFS write:
>>> The call to pnfs_set_layoutcommit() in pnfs_ld_write_done() should 
>>> take care of (1)
>>> The call to nfs_writeback_update_inode() in nfs4_write_done_cb() 
>>> should take care of (2).
>>>
>>> Provided that these 2 calls are performed in the above order, then 
>>> any call to nfs_getattr() which has not been preceded by a call to 
>>> nfs4_proc_layoutcommit() should trigger the call to 
>>> __nfs_revalidate_inode().
>>
>> I think the problem is that a following nfs_getattr() will fail to 
>> notice
>> the size change in the case of a write_completion and layoutcommit 
>> occuring
>> after nfs_getattr() has done pnfs_sync_inode() but before it has done
>> nfs_update_inode().
>>
>> In the failing case there are two threads one is doing writes, the 
>> other
>> doing lstat on aio_complete via io_getevents(2).
>>
>> For each write completion the lstat thread tries to verify the file 
>> size.
>>
>> GETATTR Thread                  LAYOUTCOMMIT Thread
>> --------------                  --------------------
>>                                write_completion sets LAYOUTCOMMIT 
>> (4096@0)
>> --> nfs_getattr
>
> filemap_write_and_wait()
>
>> __nfs_revalidate_inode
>>  pnfs_sync_inode
>
> NFS_PROTO(inode)->getattr()
>
>>  getattr sees 4096
>>                                write_completion sets LAYOUTCOMMIT 
>> (4096@4096)
>>                                sets LAYOUTCOMMITING
>>                                clears LAYOUTCOMMIT
>>                                clears LAYOUTCOMMITTING
>>  nfs_refresh_inode
>>   nfs_update_inode size is 4096
>> <-- nfs_getattr
>>
>> At this point the cached attributes are seen as up to date, but
>> aio-dio-extend-stat program expects that second write_completion to 
>> reflect
>> in the file size.
>>
>
> Why isn’t the filemap_write_and_wait() above resolving the race? 
> I’d
> expect that would move your “write completion sets LAYOUTCOMMIT” 
> up to
> before the pnfs_sync_inode().  In fact, in the patch that Christoph 
> sent,
> all he was doing was moving the pnfs_sync_inode() to immediately after
> that filemap_write_and_wait() instead of relying on it in
> _nfs_revalidate_inode.

This is O_DIRECT, I've failed to mention yet.  The second write hasn't 
made
it out of __nfs_pageio_add_request() at the time 
filemap_write_and_wait() is
called.  It is sleeping in pnfs_update_layout() waiting on a LAYOUTGET 
and it
doesn't resumes until after filemap_write_and_wait().
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Trond Myklebust July 25, 2016, 6:34 p.m. UTC | #18

DQo+IE9uIEp1bCAyNSwgMjAxNiwgYXQgMTQ6MjYsIEJlbmphbWluIENvZGRpbmd0b24gPGJjb2Rk
aW5nQHJlZGhhdC5jb20+IHdyb3RlOg0KPiANCj4gDQo+IA0KPiBPbiAyNSBKdWwgMjAxNiwgYXQg
MTI6MzksIFRyb25kIE15a2xlYnVzdCB3cm90ZToNCj4gDQo+Pj4gT24gSnVsIDI1LCAyMDE2LCBh
dCAxMjoyNiwgQmVuamFtaW4gQ29kZGluZ3RvbiA8YmNvZGRpbmdAcmVkaGF0LmNvbT4gd3JvdGU6
DQo+Pj4gDQo+Pj4gT24gMjEgSnVsIDIwMTYsIGF0IDk6MjAsIFRyb25kIE15a2xlYnVzdCB3cm90
ZToNCj4+PiANCj4+Pj4+IE9uIEp1bCAyMSwgMjAxNiwgYXQgMDk6MDUsIEJlbmphbWluIENvZGRp
bmd0b24gPGJjb2RkaW5nQHJlZGhhdC5jb20+IHdyb3RlOg0KPj4+Pj4gDQo+Pj4+PiBTbyBiYWNr
IHRvIENocmlzdG9waCdzIHBvaW50IGVhcmxpZXI6DQo+Pj4+PiANCj4+Pj4+IE9uIDE3IEp1bCAy
MDE2LCBhdCAyMzo0OCwgQ2hyaXN0b3BoIEhlbGx3aWcgd3JvdGU6DQo+Pj4+Pj4gVGhpcyBvbmUg
YnJlYWtzIHhmc3Rlc3RzIGdlbmVyaWMvMjA3IG9uIGJsb2NrL3Njc2kgbGF5b3V0IGZvciBtZS4g
IFRoZQ0KPj4+Pj4+IHJlYXNvbiBmb3IgdGhhdCBpcyB0aGF0IHdlIG5lZWQgYSBsYXlvdXRjb21t
aXQgYWZ0ZXIgd3JpdGluZyBvdXQgYWxsDQo+Pj4+Pj4gZGF0YSBmb3IgdGhlIGZpbGUgZm9yIHRo
ZSBmaWxlIHNpemUgdG8gYmUgdXBkYXRlZCBvbiB0aGUgc2VydmVyLg0KPj4+Pj4gDQo+Pj4+PiBZ
b3UgcmVzcG9uZGVkOg0KPj4+Pj4gDQo+Pj4+PiBPbiAxOCBKdWwgMjAxNiwgYXQgMDozMiwgVHJv
bmQgTXlrbGVidXN0IHdyb3RlOg0KPj4+Pj4+IEnigJltIG5vdCB1bmRlcnN0YW5kaW5nIHRoaXMg
YXJndW1lbnQuIFdoeSBkbyB3ZSBjYXJlIGlmIHRoZSBmaWxlIHNpemUgaXMgdXANCj4+Pj4+PiB0
byBkYXRlIG9uIHRoZSBzZXJ2ZXIgaWYgd2XigJlyZSBub3Qgc2VuZGluZyBhbiBhY3R1YWwgR0VU
QVRUUiBvbiB0aGUgd2lyZQ0KPj4+Pj4+IHRvIHJldHJpZXZlIHRoZSBmaWxlIHNpemU/DQo+Pj4+
PiANCj4+Pj4+IEkgZ3Vlc3MgdGhlIGFuc3dlciBtaWdodCBiZSBiZWNhdXNlIHdlIGNhbiBnZXQg
aXQgYmFjayBmcm9tIHRoZSBsYXN0DQo+Pj4+PiBMQVlPVVRDT01NSVQuDQo+Pj4+PiANCj4+Pj4g
DQo+Pj4+IFRoZSBwYXRjaCB0aGF0IEkgZm9sbG93ZWQgdXAgd2l0aCBzaG91bGQgbm93IGVuc3Vy
ZSB0aGF0IHdlIGRvIG5vdCBtYXJrIHRoZSBhdHRyaWJ1dGUgY2FjaGUgYXMgdXAgdG8gZGF0ZSBp
ZiB0aGVyZSBpcyBhIExBWU9VVENPTU1JVCBwZW5kaW5nLg0KPj4+PiBJT1c6IHdoZW4gdGhlIHBO
RlMgd3JpdGUgaXMgZG9uZSwgaXQgaXMgZXhwZWN0ZWQgdG8gZG8gMiB0aGluZ3M6DQo+Pj4+IA0K
Pj4+PiAxKSBtYXJrIHRoZSBpbm9kZSBmb3IgTEFZT1VUQ09NTUlUDQo+Pj4+IDIpIG1hcmsgdGhl
IGF0dHJpYnV0ZSBjYWNoZSBhcyBpbnZhbGlkIChiZWNhdXNlIHdlIGtub3cgdGhlIGNoYW5nZSBh
dHRyaWJ1dGUsIG10aW1lLCBjdGltZSBuZWVkIHRvIGJlIHVwZGF0ZXMpDQo+Pj4+IA0KPj4+PiBJ
biB0aGUgY2FzZSBvZiBibG9ja3MgcE5GUyB3cml0ZToNCj4+Pj4gVGhlIGNhbGwgdG8gcG5mc19z
ZXRfbGF5b3V0Y29tbWl0KCkgaW4gcG5mc19sZF93cml0ZV9kb25lKCkgc2hvdWxkIHRha2UgY2Fy
ZSBvZiAoMSkNCj4+Pj4gVGhlIGNhbGwgdG8gbmZzX3dyaXRlYmFja191cGRhdGVfaW5vZGUoKSBp
biBuZnM0X3dyaXRlX2RvbmVfY2IoKSBzaG91bGQgdGFrZSBjYXJlIG9mICgyKS4NCj4+Pj4gDQo+
Pj4+IFByb3ZpZGVkIHRoYXQgdGhlc2UgMiBjYWxscyBhcmUgcGVyZm9ybWVkIGluIHRoZSBhYm92
ZSBvcmRlciwgdGhlbiBhbnkgY2FsbCB0byBuZnNfZ2V0YXR0cigpIHdoaWNoIGhhcyBub3QgYmVl
biBwcmVjZWRlZCBieSBhIGNhbGwgdG8gbmZzNF9wcm9jX2xheW91dGNvbW1pdCgpIHNob3VsZCB0
cmlnZ2VyIHRoZSBjYWxsIHRvIF9fbmZzX3JldmFsaWRhdGVfaW5vZGUoKS4NCj4+PiANCj4+PiBJ
IHRoaW5rIHRoZSBwcm9ibGVtIGlzIHRoYXQgYSBmb2xsb3dpbmcgbmZzX2dldGF0dHIoKSB3aWxs
IGZhaWwgdG8gbm90aWNlDQo+Pj4gdGhlIHNpemUgY2hhbmdlIGluIHRoZSBjYXNlIG9mIGEgd3Jp
dGVfY29tcGxldGlvbiBhbmQgbGF5b3V0Y29tbWl0IG9jY3VyaW5nDQo+Pj4gYWZ0ZXIgbmZzX2dl
dGF0dHIoKSBoYXMgZG9uZSBwbmZzX3N5bmNfaW5vZGUoKSBidXQgYmVmb3JlIGl0IGhhcyBkb25l
DQo+Pj4gbmZzX3VwZGF0ZV9pbm9kZSgpLg0KPj4+IA0KPj4+IEluIHRoZSBmYWlsaW5nIGNhc2Ug
dGhlcmUgYXJlIHR3byB0aHJlYWRzIG9uZSBpcyBkb2luZyB3cml0ZXMsIHRoZSBvdGhlcg0KPj4+
IGRvaW5nIGxzdGF0IG9uIGFpb19jb21wbGV0ZSB2aWEgaW9fZ2V0ZXZlbnRzKDIpLg0KPj4+IA0K
Pj4+IEZvciBlYWNoIHdyaXRlIGNvbXBsZXRpb24gdGhlIGxzdGF0IHRocmVhZCB0cmllcyB0byB2
ZXJpZnkgdGhlIGZpbGUgc2l6ZS4NCj4+PiANCj4+PiBHRVRBVFRSIFRocmVhZCAgICAgICAgICAg
ICAgICAgIExBWU9VVENPTU1JVCBUaHJlYWQNCj4+PiAtLS0tLS0tLS0tLS0tLSAgICAgICAgICAg
ICAgICAgIC0tLS0tLS0tLS0tLS0tLS0tLS0tDQo+Pj4gICAgICAgICAgICAgICAgICAgICAgICAg
ICAgICAgd3JpdGVfY29tcGxldGlvbiBzZXRzIExBWU9VVENPTU1JVCAoNDA5NkAwKQ0KPj4+IC0t
PiBuZnNfZ2V0YXR0cg0KPj4gDQo+PiBmaWxlbWFwX3dyaXRlX2FuZF93YWl0KCkNCj4+IA0KPj4+
IF9fbmZzX3JldmFsaWRhdGVfaW5vZGUNCj4+PiBwbmZzX3N5bmNfaW5vZGUNCj4+IA0KPj4gTkZT
X1BST1RPKGlub2RlKS0+Z2V0YXR0cigpDQo+PiANCj4+PiBnZXRhdHRyIHNlZXMgNDA5Ng0KPj4+
ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIHdyaXRlX2NvbXBsZXRpb24gc2V0cyBMQVlP
VVRDT01NSVQgKDQwOTZANDA5NikNCj4+PiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBz
ZXRzIExBWU9VVENPTU1JVElORw0KPj4+ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIGNs
ZWFycyBMQVlPVVRDT01NSVQNCj4+PiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBjbGVh
cnMgTEFZT1VUQ09NTUlUVElORw0KPj4+IG5mc19yZWZyZXNoX2lub2RlDQo+Pj4gIG5mc191cGRh
dGVfaW5vZGUgc2l6ZSBpcyA0MDk2DQo+Pj4gPC0tIG5mc19nZXRhdHRyDQo+Pj4gDQo+Pj4gQXQg
dGhpcyBwb2ludCB0aGUgY2FjaGVkIGF0dHJpYnV0ZXMgYXJlIHNlZW4gYXMgdXAgdG8gZGF0ZSwg
YnV0DQo+Pj4gYWlvLWRpby1leHRlbmQtc3RhdCBwcm9ncmFtIGV4cGVjdHMgdGhhdCBzZWNvbmQg
d3JpdGVfY29tcGxldGlvbiB0byByZWZsZWN0DQo+Pj4gaW4gdGhlIGZpbGUgc2l6ZS4NCj4+PiAN
Cj4+IA0KPj4gV2h5IGlzbuKAmXQgdGhlIGZpbGVtYXBfd3JpdGVfYW5kX3dhaXQoKSBhYm92ZSBy
ZXNvbHZpbmcgdGhlIHJhY2U/IEnigJlkDQo+PiBleHBlY3QgdGhhdCB3b3VsZCBtb3ZlIHlvdXIg
4oCcd3JpdGUgY29tcGxldGlvbiBzZXRzIExBWU9VVENPTU1JVOKAnSB1cCB0bw0KPj4gYmVmb3Jl
IHRoZSBwbmZzX3N5bmNfaW5vZGUoKS4gIEluIGZhY3QsIGluIHRoZSBwYXRjaCB0aGF0IENocmlz
dG9waCBzZW50LA0KPj4gYWxsIGhlIHdhcyBkb2luZyB3YXMgbW92aW5nIHRoZSBwbmZzX3N5bmNf
aW5vZGUoKSB0byBpbW1lZGlhdGVseSBhZnRlcg0KPj4gdGhhdCBmaWxlbWFwX3dyaXRlX2FuZF93
YWl0KCkgaW5zdGVhZCBvZiByZWx5aW5nIG9uIGl0IGluDQo+PiBfbmZzX3JldmFsaWRhdGVfaW5v
ZGUuDQo+IA0KPiBUaGlzIGlzIE9fRElSRUNULCBJJ3ZlIGZhaWxlZCB0byBtZW50aW9uIHlldC4g
IFRoZSBzZWNvbmQgd3JpdGUgaGFzbid0IG1hZGUNCj4gaXQgb3V0IG9mIF9fbmZzX3BhZ2Vpb19h
ZGRfcmVxdWVzdCgpIGF0IHRoZSB0aW1lIGZpbGVtYXBfd3JpdGVfYW5kX3dhaXQoKSBpcw0KPiBj
YWxsZWQuICBJdCBpcyBzbGVlcGluZyBpbiBwbmZzX3VwZGF0ZV9sYXlvdXQoKSB3YWl0aW5nIG9u
IGEgTEFZT1VUR0VUIGFuZCBpdA0KPiBkb2Vzbid0IHJlc3VtZXMgdW50aWwgYWZ0ZXIgZmlsZW1h
cF93cml0ZV9hbmRfd2FpdCgpLg0KDQpXYWl0LCBzbyB5b3UgaGF2ZSAxIHRocmVhZCBkb2luZyBh
biBPX0RJUkVDVCB3cml0ZSgpIGFuZCBhbm90aGVyIGRvaW5nIGEgc3RhdCgpIGluIHBhcmFsbGVs
PyBXaHkgd291bGQgdGhlcmUgYmUgYW4gZXhwZWN0YXRpb24gdGhhdCB0aGUgZmlsZXN5c3RlbSBz
aG91bGQgc2VyaWFsaXNlIHRob3NlIHN5c3RlbSBjYWxscz8NCg0K

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Benjamin Coddington July 25, 2016, 6:41 p.m. UTC | #19

On 25 Jul 2016, at 14:34, Trond Myklebust wrote:

>> On Jul 25, 2016, at 14:26, Benjamin Coddington <bcodding@redhat.com> 
>> wrote:
>>
>>
>>
>> On 25 Jul 2016, at 12:39, Trond Myklebust wrote:
>>
>>>> On Jul 25, 2016, at 12:26, Benjamin Coddington 
>>>> <bcodding@redhat.com> wrote:
>>>>
>>>> On 21 Jul 2016, at 9:20, Trond Myklebust wrote:
>>>>
>>>>>> On Jul 21, 2016, at 09:05, Benjamin Coddington 
>>>>>> <bcodding@redhat.com> wrote:
>>>>>>
>>>>>> So back to Christoph's point earlier:
>>>>>>
>>>>>> On 17 Jul 2016, at 23:48, Christoph Hellwig wrote:
>>>>>>> This one breaks xfstests generic/207 on block/scsi layout for 
>>>>>>> me.  The
>>>>>>> reason for that is that we need a layoutcommit after writing out 
>>>>>>> all
>>>>>>> data for the file for the file size to be updated on the server.
>>>>>>
>>>>>> You responded:
>>>>>>
>>>>>> On 18 Jul 2016, at 0:32, Trond Myklebust wrote:
>>>>>>> I’m not understanding this argument. Why do we care if the 
>>>>>>> file size is up
>>>>>>> to date on the server if we’re not sending an actual GETATTR 
>>>>>>> on the wire
>>>>>>> to retrieve the file size?
>>>>>>
>>>>>> I guess the answer might be because we can get it back from the 
>>>>>> last
>>>>>> LAYOUTCOMMIT.
>>>>>>
>>>>>
>>>>> The patch that I followed up with should now ensure that we do not 
>>>>> mark the attribute cache as up to date if there is a LAYOUTCOMMIT 
>>>>> pending.
>>>>> IOW: when the pNFS write is done, it is expected to do 2 things:
>>>>>
>>>>> 1) mark the inode for LAYOUTCOMMIT
>>>>> 2) mark the attribute cache as invalid (because we know the change 
>>>>> attribute, mtime, ctime need to be updates)
>>>>>
>>>>> In the case of blocks pNFS write:
>>>>> The call to pnfs_set_layoutcommit() in pnfs_ld_write_done() should 
>>>>> take care of (1)
>>>>> The call to nfs_writeback_update_inode() in nfs4_write_done_cb() 
>>>>> should take care of (2).
>>>>>
>>>>> Provided that these 2 calls are performed in the above order, then 
>>>>> any call to nfs_getattr() which has not been preceded by a call to 
>>>>> nfs4_proc_layoutcommit() should trigger the call to 
>>>>> __nfs_revalidate_inode().
>>>>
>>>> I think the problem is that a following nfs_getattr() will fail to 
>>>> notice
>>>> the size change in the case of a write_completion and layoutcommit 
>>>> occuring
>>>> after nfs_getattr() has done pnfs_sync_inode() but before it has 
>>>> done
>>>> nfs_update_inode().
>>>>
>>>> In the failing case there are two threads one is doing writes, the 
>>>> other
>>>> doing lstat on aio_complete via io_getevents(2).
>>>>
>>>> For each write completion the lstat thread tries to verify the file 
>>>> size.
>>>>
>>>> GETATTR Thread                  LAYOUTCOMMIT Thread
>>>> --------------                  --------------------
>>>>                               write_completion sets LAYOUTCOMMIT 
>>>> (4096@0)
>>>> --> nfs_getattr
>>>
>>> filemap_write_and_wait()
>>>
>>>> __nfs_revalidate_inode
>>>> pnfs_sync_inode
>>>
>>> NFS_PROTO(inode)->getattr()
>>>
>>>> getattr sees 4096
>>>>                               write_completion sets LAYOUTCOMMIT 
>>>> (4096@4096)
>>>>                               sets LAYOUTCOMMITING
>>>>                               clears LAYOUTCOMMIT
>>>>                               clears LAYOUTCOMMITTING
>>>> nfs_refresh_inode
>>>>  nfs_update_inode size is 4096
>>>> <-- nfs_getattr
>>>>
>>>> At this point the cached attributes are seen as up to date, but
>>>> aio-dio-extend-stat program expects that second write_completion to 
>>>> reflect
>>>> in the file size.
>>>>
>>>
>>> Why isn’t the filemap_write_and_wait() above resolving the race? 
>>> I’d
>>> expect that would move your “write completion sets LAYOUTCOMMIT” 
>>> up to
>>> before the pnfs_sync_inode().  In fact, in the patch that Christoph 
>>> sent,
>>> all he was doing was moving the pnfs_sync_inode() to immediately 
>>> after
>>> that filemap_write_and_wait() instead of relying on it in
>>> _nfs_revalidate_inode.
>>
>> This is O_DIRECT, I've failed to mention yet.  The second write 
>> hasn't made
>> it out of __nfs_pageio_add_request() at the time 
>> filemap_write_and_wait() is
>> called.  It is sleeping in pnfs_update_layout() waiting on a 
>> LAYOUTGET and it
>> doesn't resumes until after filemap_write_and_wait().
>
> Wait, so you have 1 thread doing an O_DIRECT write() and another doing 
> a
> stat() in parallel? Why would there be an expectation that the 
> filesystem
> should serialise those system calls?

Not exactly parallel, but synchronized on aio_complete.  A comment in
generic/207's src/aio-dio-regress/aio-dio-extend-stat.c:

  36 /*
  37  * This was originally submitted to
  38  * http://bugzilla.kernel.org/show_bug.cgi?id=6831 by
  39  * Rafal Wijata <wijata@nec-labs.com>.  It caught a race in dio aio 
completion
  40  * that would call aio_complete() before the dio callers would 
update i_size.
  41  * A stat after io_getevents() would not see the new file size.
  42  *
  43  * The bug was fixed in the fs/direct-io.c completion reworking 
that appeared
  44  * in 2.6.20.  This test should fail on 2.6.19.
  45  */

As far as I can see, this check is the whole point of generic/207..
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[v4,24/28] NFS: Getattr doesn't require data sync semantics

Commit Message

Comments

Patch