NFS/d_splice_alias breakage

Message ID	20160603042648.GN14480@ZenIV.linux.org.uk (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> Date: Fri, 3 Jun 2016 05:26:48 +0100 From: Al Viro <viro@ZenIV.linux.org.uk> To: Oleg Drokin <green@linuxhacker.ru> Cc: "J. Bruce Fields" <bfields@redhat.com>, linux-nfs@vger.kernel.org, "<linux-kernel@vger.kernel.org> Mailing List" <linux-kernel@vger.kernel.org>, "<linux-fsdevel@vger.kernel.org>" <linux-fsdevel@vger.kernel.org> Subject: Re: NFS/d_splice_alias breakage Message-ID: <20160603042648.GN14480@ZenIV.linux.org.uk> References: <CF78E59B-F311-437B-9D1C-1F3FF6C22126@linuxhacker.ru> <20160603033750.GL14480@ZenIV.linux.org.uk> <0C971585-6BFC-4665-832B-9B262F733BFC@linuxhacker.ru> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <0C971585-6BFC-4665-832B-9B262F733BFC@linuxhacker.ru> User-Agent: Mutt/1.6.0 (2016-04-01) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk

Al Viro June 3, 2016, 4:26 a.m. UTC

On Thu, Jun 02, 2016 at 11:43:59PM -0400, Oleg Drokin wrote:

> > Which of the call sites had that been and how does one reproduce that fun?
> > If you feel that posting a reproducer in the open is a bad idea, just send
> > it off-list...
> 
> This is fs/nfs/dir.c::nfs_lookup() right after no_entry label.

Bloody hell...  Was that sucker hashed on the entry into nfs_lookup()?
If we get that with call as a method, we are really fucked.

<greps>

Ho-hum...  One of the goto no_open; in nfs_atomic_open()?  That would
mean a stale negative dentry in dcache that has escaped revalidation...
Wait a minute, didn't nfs ->d_revalidate() skip everything in some
cases, leaving it to ->atomic_open()?

Looks like the right thing to do would be to do d_drop() at no_open:,
just before we call nfs_lookup().  If not earlier, actually...  How
about the following?

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro June 3, 2016, 4:42 a.m. UTC | #1

On Fri, Jun 03, 2016 at 05:26:48AM +0100, Al Viro wrote:

> Looks like the right thing to do would be to do d_drop() at no_open:,
> just before we call nfs_lookup().  If not earlier, actually...  How
> about the following?

A bit of rationale: dentry in question is negative and attempt to open
it has just failed; in case it's really negative we did that d_drop()
anyway (and then unconditionally rehashed it).  In case when we proceed to
nfs_lookup() and it does not fail, we'll have it rehashed there (with the
right inode).  What do we lose from doing d_drop() on *any* error here?
It's negative, with dubious validity.  In the worst case, we had open
and lookup fail, but it's just us - the sucker really is negative and
somebody else would be able to revalidate it.  If we drop it here (and
not rehash later), that somebody else will have to allocate an in-core
dentry before doing lookup or atomic_open.  Which is negligible extra
burden.

> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> index aaf7bd0..6e3a6f4 100644
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -1536,9 +1536,9 @@ int nfs_atomic_open(struct inode *dir, struct dentry *dentry,
>  		err = PTR_ERR(inode);
>  		trace_nfs_atomic_open_exit(dir, ctx, open_flags, err);
>  		put_nfs_open_context(ctx);
> +		d_drop(dentry);
>  		switch (err) {
>  		case -ENOENT:
> -			d_drop(dentry);
>  			d_add(dentry, NULL);
>  			nfs_set_verifier(dentry, nfs_save_change_attribute(dir));
>  			break;
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro June 3, 2016, 4:53 a.m. UTC | #2

On Fri, Jun 03, 2016 at 05:42:53AM +0100, Al Viro wrote:
> On Fri, Jun 03, 2016 at 05:26:48AM +0100, Al Viro wrote:
> 
> > Looks like the right thing to do would be to do d_drop() at no_open:,
> > just before we call nfs_lookup().  If not earlier, actually...  How
> > about the following?
> 
> A bit of rationale: dentry in question is negative and attempt to open
> it has just failed; in case it's really negative we did that d_drop()
> anyway (and then unconditionally rehashed it).  In case when we proceed to
> nfs_lookup() and it does not fail, we'll have it rehashed there (with the
> right inode).  What do we lose from doing d_drop() on *any* error here?
> It's negative, with dubious validity.  In the worst case, we had open
> and lookup fail, but it's just us - the sucker really is negative and
> somebody else would be able to revalidate it.  If we drop it here (and
> not rehash later), that somebody else will have to allocate an in-core
> dentry before doing lookup or atomic_open.  Which is negligible extra
> burden.

I suspect that it got broken in commit 275bb3078 (NFSv4: Move dentry
instantiation into the NFSv4-specific atomic open code).  Prior to that
commit, d_drop() was there (error or no error).  Looks like 3.10+, indeed.
I agree that we shouldn't drop it on successful open, but it needs to be
done on errors.  All of them, not just ENOENT, as in that commit.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Oleg Drokin June 3, 2016, 4:58 a.m. UTC | #3

On Jun 3, 2016, at 12:26 AM, Al Viro wrote:

> On Thu, Jun 02, 2016 at 11:43:59PM -0400, Oleg Drokin wrote:
> 
>>> Which of the call sites had that been and how does one reproduce that fun?
>>> If you feel that posting a reproducer in the open is a bad idea, just send
>>> it off-list...
>> 
>> This is fs/nfs/dir.c::nfs_lookup() right after no_entry label.
> 
> Bloody hell...  Was that sucker hashed on the entry into nfs_lookup()?
> If we get that with call as a method, we are really fucked.
> 
> <greps>
> 
> Ho-hum...  One of the goto no_open; in nfs_atomic_open()?  That would
> mean a stale negative dentry in dcache that has escaped revalidation...
> Wait a minute, didn't nfs ->d_revalidate() skip everything in some
> cases, leaving it to ->atomic_open()?
> 
> Looks like the right thing to do would be to do d_drop() at no_open:,
> just before we call nfs_lookup().  If not earlier, actually...  How
> about the following?

This one cures the insta-crash I was having, and I see no other ill-effects so far.

> 
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> index aaf7bd0..6e3a6f4 100644
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -1536,9 +1536,9 @@ int nfs_atomic_open(struct inode *dir, struct dentry *dentry,
> 		err = PTR_ERR(inode);
> 		trace_nfs_atomic_open_exit(dir, ctx, open_flags, err);
> 		put_nfs_open_context(ctx);
> +		d_drop(dentry);
> 		switch (err) {
> 		case -ENOENT:
> -			d_drop(dentry);
> 			d_add(dentry, NULL);
> 			nfs_set_verifier(dentry, nfs_save_change_attribute(dir));
> 			break;

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Oleg Drokin June 3, 2016, 4:38 p.m. UTC | #4

I am dropping NFS people since it seems to be converting into a generic VFS/dcache bug even though you need NFS or the like to trigger it - the lookup_open path.

On Jun 3, 2016, at 12:26 AM, Al Viro wrote:

> Looks like the right thing to do would be to do d_drop() at no_open:,
> just before we call nfs_lookup().  If not earlier, actually...  How
> about the following?

So with the below patch applied I get the same crash as with the patch from me
that allowed unhashed dentries to come through.
Same workload, but this time it takes much longer to trigger, so must be some
race.

[ 2642.364383] BUG: unable to handle kernel paging request at ffff880113f82000
[ 2642.365014] IP: [<ffffffff817f87d4>] bad_gs+0xd1d/0x1ba9
[ 2642.365590] PGD 2dca067 PUD 2dcd067 PMD 11ff61067 PTE 8000000113f82060
[ 2642.366200] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 2642.366719] Modules linked in: loop rpcsec_gss_krb5 acpi_cpufreq tpm_tis joydev i2c_piix4 pcspkr tpm nfsd drm_kms_helper ttm drm virtio_blk serio_raw
[ 2642.367852] CPU: 2 PID: 18446 Comm: cat Not tainted 4.7.0-rc1-vm-nfs+ #103
[ 2642.368372] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 2642.368873] task: ffff8800d6956280 ti: ffff8800d7274000 task.ti: ffff8800d7274000
[ 2642.369810] RIP: 0010:[<ffffffff817f87d4>]  [<ffffffff817f87d4>] bad_gs+0xd1d/0x1ba9
[ 2642.370750] RSP: 0018:ffff8800d7277ba0  EFLAGS: 00010286
[ 2642.371239] RAX: ffff8800c3a6ff30 RBX: ffff8800c3a6ff00 RCX: ffff880113f82000
[ 2642.371765] RDX: ffff880113f82000 RSI: 0000000000000000 RDI: 0000000000000002
[ 2642.372286] RBP: ffff8800d7277be8 R08: 0000000000000000 R09: ffff8800c3a6fed0
[ 2642.372818] R10: 0000000000000059 R11: ffff8800d6956dd0 R12: ffff880111567ed0
[ 2642.373415] R13: ffff8800d7277df0 R14: ffff8800c3a6ff50 R15: 0000000084832a57
[ 2642.373940] FS:  00007fa1814a4700(0000) GS:ffff88011f480000(0000) knlGS:0000000000000000
[ 2642.374877] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2642.375378] CR2: ffff880113f82000 CR3: 00000000d605c000 CR4: 00000000000006e0
[ 2642.375881] Stack:
[ 2642.376295]  ffffffff81243a55 0000000000000096 ffff880113f82000 ffffffff00000002
[ 2642.377204]  000000000000bc04 ffff8800d7277df0 ffff880111567ed0 ffff8800d7277de0
[ 2642.378113]  ffff8800d7277df0 ffff8800d7277c18 ffffffff81243c84 ffffffff81236c4e
[ 2642.379022] Call Trace:
[ 2642.379451]  [<ffffffff81243a55>] ? __d_lookup+0x5/0x1b0
[ 2642.379920]  [<ffffffff81243c84>] d_lookup+0x84/0xb0
[ 2642.380388]  [<ffffffff81236c4e>] ? lookup_open+0xfe/0x7a0
[ 2642.380862]  [<ffffffff81236c4e>] lookup_open+0xfe/0x7a0
[ 2642.381374]  [<ffffffff81237c3f>] path_openat+0x94f/0xfc0
[ 2642.381852]  [<ffffffff8123935e>] do_filp_open+0x7e/0xe0
[ 2642.382182]  [<ffffffff81233110>] ? lock_rename+0x100/0x100
[ 2642.382747]  [<ffffffff817f4947>] ? _raw_spin_unlock+0x27/0x40
[ 2642.383324]  [<ffffffff8124877c>] ? __alloc_fd+0xbc/0x170
[ 2642.383864]  [<ffffffff81226896>] do_sys_open+0x116/0x1f0
[ 2642.384230]  [<ffffffff8122698e>] SyS_open+0x1e/0x20
[ 2642.384569]  [<ffffffff817f5136>] entry_SYSCALL_64_fastpath+0x1e/0xad
[ 2642.398718] Code: e1 03 49 d3 e8 e9 61 ae a4 ff 48 8d 0a 48 83 e1 f8 4c 8b 01 8d 0a 83 e1 07 c1 e1 03 49 d3 e8 e9 20 af a4 ff 48 8d 0a 48 83 e1 f8 <4c> 8b 01 8d 0a 83 e1 07 c1 e1 03 49 d3 e8 e9 9b b3 a4 ff b9 f2 
[ 2642.400892] RIP  [<ffffffff817f87d4>] bad_gs+0xd1d/0x1ba9
[ 2642.401417]  RSP <ffff8800d7277ba0>
[ 2642.401856] CR2: ffff880113f82000

Hm, somehow crashdumping support is broken for the newish kernels on my test box, I guess
I'll try to fix it and then re-reproduce to better understand what's going on here,
this trace is all I have for now in case anybody has any immediate ideas.


> 
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> index aaf7bd0..6e3a6f4 100644
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -1536,9 +1536,9 @@ int nfs_atomic_open(struct inode *dir, struct dentry *dentry,
> 		err = PTR_ERR(inode);
> 		trace_nfs_atomic_open_exit(dir, ctx, open_flags, err);
> 		put_nfs_open_context(ctx);
> +		d_drop(dentry);
> 		switch (err) {
> 		case -ENOENT:
> -			d_drop(dentry);
> 			d_add(dentry, NULL);
> 			nfs_set_verifier(dentry, nfs_save_change_attribute(dir));
> 			break;

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro June 3, 2016, 6:22 p.m. UTC | #5

On Fri, Jun 03, 2016 at 12:38:40PM -0400, Oleg Drokin wrote:
> I am dropping NFS people since it seems to be converting into a generic VFS/dcache bug even though you need NFS or the like to trigger it - the lookup_open path.

NFS bug is real; there might very well be something else, but that d_drop()
in nfs_atomic_open() needs to be restored.

> [ 2642.364383] BUG: unable to handle kernel paging request at ffff880113f82000
> [ 2642.365014] IP: [<ffffffff817f87d4>] bad_gs+0xd1d/0x1ba9

*ow*
Could you dump your vmlinux (and System.map) somewhere on anonftp?
This 'bad_gs' is there simply because it's one of the few labels in
.fixup - to say anything useful we'll need to find out where we'd
really come from.

> [ 2642.369810] RIP: 0010:[<ffffffff817f87d4>]  [<ffffffff817f87d4>] bad_gs+0xd1d/0x1ba9
> [ 2642.370750] RSP: 0018:ffff8800d7277ba0  EFLAGS: 00010286
> [ 2642.371239] RAX: ffff8800c3a6ff30 RBX: ffff8800c3a6ff00 RCX: ffff880113f82000
> [ 2642.371765] RDX: ffff880113f82000 RSI: 0000000000000000 RDI: 0000000000000002
> [ 2642.372286] RBP: ffff8800d7277be8 R08: 0000000000000000 R09: ffff8800c3a6fed0
> [ 2642.372818] R10: 0000000000000059 R11: ffff8800d6956dd0 R12: ffff880111567ed0
> [ 2642.373415] R13: ffff8800d7277df0 R14: ffff8800c3a6ff50 R15: 0000000084832a57
> [ 2642.373940] FS:  00007fa1814a4700(0000) GS:ffff88011f480000(0000) knlGS:0000000000000000
> [ 2642.374877] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2642.375378] CR2: ffff880113f82000 CR3: 00000000d605c000 CR4: 00000000000006e0
> [ 2642.375881] Stack:
> [ 2642.376295]  ffffffff81243a55 0000000000000096 ffff880113f82000 ffffffff00000002
> [ 2642.377204]  000000000000bc04 ffff8800d7277df0 ffff880111567ed0 ffff8800d7277de0
> [ 2642.378113]  ffff8800d7277df0 ffff8800d7277c18 ffffffff81243c84 ffffffff81236c4e
> [ 2642.379022] Call Trace:
> [ 2642.379451]  [<ffffffff81243a55>] ? __d_lookup+0x5/0x1b0
> [ 2642.379920]  [<ffffffff81243c84>] d_lookup+0x84/0xb0
> [ 2642.380388]  [<ffffffff81236c4e>] ? lookup_open+0xfe/0x7a0
> [ 2642.380862]  [<ffffffff81236c4e>] lookup_open+0xfe/0x7a0
> [ 2642.381374]  [<ffffffff81237c3f>] path_openat+0x94f/0xfc0
> [ 2642.381852]  [<ffffffff8123935e>] do_filp_open+0x7e/0xe0
> [ 2642.382182]  [<ffffffff81233110>] ? lock_rename+0x100/0x100
> [ 2642.382747]  [<ffffffff817f4947>] ? _raw_spin_unlock+0x27/0x40
> [ 2642.383324]  [<ffffffff8124877c>] ? __alloc_fd+0xbc/0x170
> [ 2642.383864]  [<ffffffff81226896>] do_sys_open+0x116/0x1f0
> [ 2642.384230]  [<ffffffff8122698e>] SyS_open+0x1e/0x20
> [ 2642.384569]  [<ffffffff817f5136>] entry_SYSCALL_64_fastpath+0x1e/0xad
> [ 2642.398718] Code: e1 03 49 d3 e8 e9 61 ae a4 ff 48 8d 0a 48 83 e1 f8 4c 8b 01 8d 0a 83 e1 07 c1 e1 03 49 d3 e8 e9 20 af a4 ff 48 8d 0a 48 83 e1 f8 <4c> 8b 01 8d 0a 83 e1 07 c1 e1 03 49 d3 e8 e9 9b b3 a4 ff b9 f2 
> [ 2642.400892] RIP  [<ffffffff817f87d4>] bad_gs+0xd1d/0x1ba9
> [ 2642.401417]  RSP <ffff8800d7277ba0>
> [ 2642.401856] CR2: ffff880113f82000
> 
> Hm, somehow crashdumping support is broken for the newish kernels on my test box, I guess
> I'll try to fix it and then re-reproduce to better understand what's going on here,
> this trace is all I have for now in case anybody has any immediate ideas.

PS: Oleg, fix your MUA, please - long lines in mail are bloody annoying.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Oleg Drokin June 3, 2016, 6:35 p.m. UTC | #6

On Jun 3, 2016, at 2:22 PM, Al Viro wrote:

> On Fri, Jun 03, 2016 at 12:38:40PM -0400, Oleg Drokin wrote:
>> I am dropping NFS people since it seems to be converting into a generic VFS/dcache bug even though you need NFS or the like to trigger it - the lookup_open path.
> 
> NFS bug is real; there might very well be something else, but that d_drop()
> in nfs_atomic_open() needs to be restored.

Yes, that's what I mean - after the nfs fix, there seems to be another bug in
dcache that I can now trigger once nfs bug is no longer blocking the way.

>> [ 2642.364383] BUG: unable to handle kernel paging request at ffff880113f82000
>> [ 2642.365014] IP: [<ffffffff817f87d4>] bad_gs+0xd1d/0x1ba9
> 
> *ow*
> Could you dump your vmlinux (and System.map) somewhere on anonftp?
> This 'bad_gs' is there simply because it's one of the few labels in
> .fixup - to say anything useful we'll need to find out where we'd
> really come from.

I see.
vmlinux with debug symbols: http://knox.linuxhacker.ru/tmp/dcache/vmlinux.gz
System.map: http://knox.linuxhacker.ru/tmp/dcache/System.map.gz

>> Hm, somehow crashdumping support is broken for the newish kernels on my test box, I guess
>> I'll try to fix it and then re-reproduce to better understand what's going on here,
>> this trace is all I have for now in case anybody has any immediate ideas.
> 
> PS: Oleg, fix your MUA, please - long lines in mail are bloody annoying.

Huh, sorry about that. I kind of hoped the era of 80 columns text terminals was
mostly behind us. Time to dust off mutt, I guess.--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro June 3, 2016, 8:07 p.m. UTC | #7

On Fri, Jun 03, 2016 at 02:35:41PM -0400, Oleg Drokin wrote:

> >> [ 2642.364383] BUG: unable to handle kernel paging request at ffff880113f82000
> >> [ 2642.365014] IP: [<ffffffff817f87d4>] bad_gs+0xd1d/0x1ba9
> > 
> > *ow*
> > Could you dump your vmlinux (and System.map) somewhere on anonftp?
> > This 'bad_gs' is there simply because it's one of the few labels in
> > .fixup - to say anything useful we'll need to find out where we'd
> > really come from.
> 
> I see.
> vmlinux with debug symbols: http://knox.linuxhacker.ru/tmp/dcache/vmlinux.gz
> System.map: http://knox.linuxhacker.ru/tmp/dcache/System.map.gz

OK...
ffffffff817f87cd:       48 8d 0a                lea    (%rdx),%rcx
ffffffff817f87d0:       48 83 e1 f8             and    $0xfffffffffffffff8,%rcx
ffffffff817f87d4:       4c 8b 01                mov    (%rcx),%r8
ffffffff817f87d7:       8d 0a                   lea    (%rdx),%ecx
ffffffff817f87d9:       83 e1 07                and    $0x7,%ecx
ffffffff817f87dc:       c1 e1 03                shl    $0x3,%ecx
ffffffff817f87df:       49 d3 e8                shr    %cl,%r8
ffffffff817f87e2:       e9 9b b3 a4 ff          jmpq   ffffffff81243b82 <__d_lookup+0x132>

Aha...  It's load_unaligned_zeropad() from dentry_string_cmp(), hitting
a genuinely unmapped address.  That sends it into fixup, where it tries to
load an aligned word containing the address in question, in hope that
fault was on attempt to cross into the next page.  No such luck, address
was aligned in the first place (it's in %rdx - 0xffff880113f82000), so
we still oops.

The unexpected part is that unmapped address did *NOT* come from a dentry;
it's .name of qstr we were looking for.  And your call chain was
__d_lookup() <- d_lookup() <- lookup_open(), so in lookup_open() it was
nd->last.name...

Can the same thing be reproduced (with NFS fix) on v4.6, ede4090, 7f427d3,
4e8440b?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Oleg Drokin June 3, 2016, 9:17 p.m. UTC | #8

On Jun 3, 2016, at 4:07 PM, Al Viro wrote:

> On Fri, Jun 03, 2016 at 02:35:41PM -0400, Oleg Drokin wrote:
> 
>>>> [ 2642.364383] BUG: unable to handle kernel paging request at ffff880113f82000
>>>> [ 2642.365014] IP: [<ffffffff817f87d4>] bad_gs+0xd1d/0x1ba9
>>> 
>>> *ow*
>>> Could you dump your vmlinux (and System.map) somewhere on anonftp?
>>> This 'bad_gs' is there simply because it's one of the few labels in
>>> .fixup - to say anything useful we'll need to find out where we'd
>>> really come from.
>> 
>> I see.
>> vmlinux with debug symbols: http://knox.linuxhacker.ru/tmp/dcache/vmlinux.gz
>> System.map: http://knox.linuxhacker.ru/tmp/dcache/System.map.gz
> 
> OK...
> ffffffff817f87cd:       48 8d 0a                lea    (%rdx),%rcx
> ffffffff817f87d0:       48 83 e1 f8             and    $0xfffffffffffffff8,%rcx
> ffffffff817f87d4:       4c 8b 01                mov    (%rcx),%r8
> ffffffff817f87d7:       8d 0a                   lea    (%rdx),%ecx
> ffffffff817f87d9:       83 e1 07                and    $0x7,%ecx
> ffffffff817f87dc:       c1 e1 03                shl    $0x3,%ecx
> ffffffff817f87df:       49 d3 e8                shr    %cl,%r8
> ffffffff817f87e2:       e9 9b b3 a4 ff          jmpq   ffffffff81243b82 <__d_lookup+0x132>
> 
> Aha...  It's load_unaligned_zeropad() from dentry_string_cmp(), hitting
> a genuinely unmapped address.  That sends it into fixup, where it tries to
> load an aligned word containing the address in question, in hope that
> fault was on attempt to cross into the next page.  No such luck, address
> was aligned in the first place (it's in %rdx - 0xffff880113f82000), so
> we still oops.
> 
> The unexpected part is that unmapped address did *NOT* come from a dentry;
> it's .name of qstr we were looking for.  And your call chain was
> __d_lookup() <- d_lookup() <- lookup_open(), so in lookup_open() it was
> nd->last.name...
> 
> Can the same thing be reproduced (with NFS fix) on v4.6, ede4090, 7f427d3,
> 4e8440b?

Well, that was faster than I expected. 4e8440b triggers right away, so I guess
there's no point in trying the later ones?
BTW, just to confirm you are noticing - this is a DEBUG_PAGEALLOC build,
so all freed memory is unmapped which is likely causing this oops - as a sign
of use after free.

[   54.990119] BUG: unable to handle kernel paging request at ffff8800d2b7f000
[   54.990423] IP: [<ffffffff817f91b6>] bad_gs+0xcff/0x1b89
[   54.990598] PGD 2dca067 PUD 11f900067 PMD 11f86a067 PTE 80000000d2b7f060
[   54.990942] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[   54.991044] Modules linked in: loop rpcsec_gss_krb5 joydev pcspkr i2c_piix4 acpi_cpufreq tpm_tis tpm nfsd drm_kms_helper ttm drm serio_raw virtio_blk
[   54.992301] CPU: 7 PID: 5550 Comm: file_concat.sh Not tainted 4.6.0-4e8440b+ #1
[   54.993019] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   54.993407] task: ffff8800d6ec6200 ti: ffff880110f44000 task.ti: ffff880110f44000
[   54.994112] RIP: 0010:[<ffffffff817f91b6>]  [<ffffffff817f91b6>] bad_gs+0xcff/0x1b89
[   54.994870] RSP: 0018:ffff880110f47ba0  EFLAGS: 00010286
[   54.995288] RAX: ffff8800d2760f30 RBX: ffff8800d2760f00 RCX: ffff8800d2b7f000
[   54.995736] RDX: ffff8800d2b7f000 RSI: 0000000000000000 RDI: 0000000000000001
[   54.996154] RBP: ffff880110f47be8 R08: 0000000000000000 R09: ffff8800d2760ed0
[   54.996628] R10: 0000000000000059 R11: ffff8800d6ec6d78 R12: ffff8800d2694ed0
[   54.997044] R13: ffff880110f47df0 R14: ffff8800d2760f50 R15: 00000000b761b37e
[   54.997450] FS:  00007fe8804d2700(0000) GS:ffff88011f5c0000(0000) knlGS:0000000000000000
[   54.998164] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   54.998624] CR2: ffff8800d2b7f000 CR3: 000000011931f000 CR4: 00000000000006e0
[   54.999225] Stack:
[   54.999596]  ffffffff81243965 0000000000000096 ffff8800d2b7f000 ffffffff00000001
[   55.000632]  000000000000042c ffff880110f47df0 ffff8800d2694ed0 ffff880110f47de0
[   55.001497]  ffff880110f47df0 ffff880110f47c18 ffffffff81243b94 ffffffff81234b3e
[   55.002353] Call Trace:
[   55.002709]  [<ffffffff81243965>] ? __d_lookup+0x5/0x1b0
[   55.003104]  [<ffffffff81243b94>] d_lookup+0x84/0xb0
[   55.003487]  [<ffffffff81234b3e>] ? lookup_open+0xfe/0x7a0
[   55.003878]  [<ffffffff81234b3e>] lookup_open+0xfe/0x7a0
[   55.004272]  [<ffffffff8123717e>] path_openat+0x4de/0xfc0
[   55.004665]  [<ffffffff8123925e>] do_filp_open+0x7e/0xe0
[   55.005061]  [<ffffffff812330a0>] ? lock_rename+0x100/0x100
[   55.005452]  [<ffffffff817f5367>] ? _raw_spin_unlock+0x27/0x40
[   55.005846]  [<ffffffff8124868c>] ? __alloc_fd+0xbc/0x170
[   55.006246]  [<ffffffff81226896>] do_sys_open+0x116/0x1f0
[   55.006635]  [<ffffffff8122698e>] SyS_open+0x1e/0x20
[   55.007036]  [<ffffffff817f5b36>] entry_SYSCALL_64_fastpath+0x1e/0xad
[   55.007434] Code: e1 03 49 d3 e8 e9 8f a3 a4 ff 48 8d 0a 48 83 e1 f8 4c 8b 01 8d 0a 83 e1 07 c1 e1 03 49 d3 e8 e9 4e a4 a4 ff 48 8d 0a 48 83 e1 f8 <4c> 8b 01 8d 0a 83 e1 07 c1 e1 03 49 d3 e8 e9 c9 a8 a4 ff b9 f2 
[   55.011146] RIP  [<ffffffff817f91b6>] bad_gs+0xcff/0x1b89
[   55.011584]  RSP <ffff880110f47ba0>
[   55.011943] CR2: ffff8800d2b7f000


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds June 3, 2016, 9:18 p.m. UTC | #9

On Fri, Jun 3, 2016 at 1:07 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Aha...  It's load_unaligned_zeropad() from dentry_string_cmp(), hitting
> a genuinely unmapped address.  That sends it into fixup, where it tries to
> load an aligned word containing the address in question, in hope that
> fault was on attempt to cross into the next page.  No such luck, address
> was aligned in the first place (it's in %rdx - 0xffff880113f82000), so
> we still oops.

Hmm. We do end up comparing the string length racily with the name
address, so I suspect there could be a real bug in dentry_cmp: the
fact that the name pointer and length aren't necessarily atomic wrt
each other means that we could overrun the dentry pointer to the next
page, even though it's aligned.

So maybe we need to do a careful load of the aligned dentry string
value too, not just the possibly unaligned qstr name. It's a *very*
unlikely race to hit, though. You'd have to shrink the name in the
dentry, get the old (longer name length) to match the one you look up,
and when have the memory unmapped at the end of the new (short)
length.

However, this is not that theoretical race case for two reasons:

 (1) this happens in __d_lookup(), which has taken the dentry lock. So
it's not the racy unlocked RCU case to begin with.

 (2) as you point out, since it is the load_unaligned_zeropad() case,
it isn't the possibly racy dentry name at all, but trhe qstr we're
comparing against (which may have the unaligned case, but not the
confusion about length).

> The unexpected part is that unmapped address did *NOT* come from a dentry;
> it's .name of qstr we were looking for.  And your call chain was
> __d_lookup() <- d_lookup() <- lookup_open(), so in lookup_open() it was
> nd->last.name...

That should have no such issues. It should be a normal qstr, and the
length should be reliable.

So something must have corrupted the qstr.

The remaining length *should* in %edi, judging by the

   0xffffffff81243b82 <+306>: cmp    $0x7,%edi

in the __d_lookup() disassembly. And %rdi contains 2, so there were
supposed to be two more characters at 'ct' (which is %rdx).

Why would nd->last.name be bogus? I don't see anything.

           Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro June 3, 2016, 9:26 p.m. UTC | #10

On Fri, Jun 03, 2016 at 02:18:15PM -0700, Linus Torvalds wrote:

> So something must have corrupted the qstr.
> 
> The remaining length *should* in %edi, judging by the
> 
>    0xffffffff81243b82 <+306>: cmp    $0x7,%edi
> 
> in the __d_lookup() disassembly. And %rdi contains 2, so there were
> supposed to be two more characters at 'ct' (which is %rdx).

... and since r8 and rsi are 0, we couldn't have consumed anything.
> 
> Why would nd->last.name be bogus? I don't see anything.

An interesting part is that it's page-aligned.  Which is impossible for
a short name obtained by getname(), but is quite likely for a symlink body.
So at a guess, we have a page containing a symlink body freed under us.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro June 3, 2016, 9:46 p.m. UTC | #11

On Fri, Jun 03, 2016 at 05:17:06PM -0400, Oleg Drokin wrote:

> > Can the same thing be reproduced (with NFS fix) on v4.6, ede4090, 7f427d3,
> > 4e8440b?
> 
> Well, that was faster than I expected. 4e8440b triggers right away, so I guess
> there's no point in trying the later ones?
> BTW, just to confirm you are noticing - this is a DEBUG_PAGEALLOC build,
> so all freed memory is unmapped which is likely causing this oops - as a sign
> of use after free.
 
> [   54.990119] BUG: unable to handle kernel paging request at ffff8800d2b7f000

Again a page-aligned nd->last.name and even smaller nd->last.len.  It smells
like a page that used to contain a symlink body, but got freed under us.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds June 3, 2016, 10 p.m. UTC | #12

On Fri, Jun 3, 2016 at 2:26 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>
>> in the __d_lookup() disassembly. And %rdi contains 2, so there were
>> supposed to be two more characters at 'ct' (which is %rdx).
>
> ... and since r8 and rsi are 0, we couldn't have consumed anything.

Right you are. So it really started out page-aligned.

>> Why would nd->last.name be bogus? I don't see anything.
>
> An interesting part is that it's page-aligned.  Which is impossible for
> a short name obtained by getname(), but is quite likely for a symlink body.
> So at a guess, we have a page containing a symlink body freed under us.

Hmm. Good point.

Is perhaps the "delayed_call" logic broken, and the symlink is free'd too early?

That whole set_delayed_call/do_delayed_call thing came in 4.5. Maybe
something broke that logic, and we've executed the delayed freeing
before we should have.

Normally it's done at terminate_walk() time. But I note that in
walk_component(), we do put_link(nd) which does a do_delayed_call(),
but does *not* do a clear_delayed_call(), so now I think a subsequent
terminate_walk() might drop it *again*.

I'm probably missing something, but I have to say that the old
explicit "put_link()" callback logic was more obvious than the new
delayed calls are.

          Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro June 3, 2016, 10:17 p.m. UTC | #13

On Fri, Jun 03, 2016 at 10:46:31PM +0100, Al Viro wrote:
> On Fri, Jun 03, 2016 at 05:17:06PM -0400, Oleg Drokin wrote:
> 
> > > Can the same thing be reproduced (with NFS fix) on v4.6, ede4090, 7f427d3,
> > > 4e8440b?
> > 
> > Well, that was faster than I expected. 4e8440b triggers right away, so I guess
> > there's no point in trying the later ones?
> > BTW, just to confirm you are noticing - this is a DEBUG_PAGEALLOC build,
> > so all freed memory is unmapped which is likely causing this oops - as a sign
> > of use after free.
>  
> > [   54.990119] BUG: unable to handle kernel paging request at ffff8800d2b7f000
> 
> Again a page-aligned nd->last.name and even smaller nd->last.len.  It smells
> like a page that used to contain a symlink body, but got freed under us.

OK, I think I understand what's going on there.  We have a pathname that ends
with a trailing symlink.  Traverse that symlink up to the last component.  And
get EOPENSTALE on attempt to open that.  At that point we proceed to
retry_lookup: and call lookup_open().  But we'd *already* done put_link()
on the first pass, so now nd->last.name points into freed page.

Damn...  I'm very tempted to rip the retry_lookup logics out of there and
just let the damn thing repeat the whole pathname resolution ;-/  do_last()
will become so much saner after that...

Let's at least verify that this is what's going on - remove
                if (error == -EOPENSTALE)
                        goto stale_open;
from do_last() and see if that fixes the damn thing.  Alternative solution
would be to turn that
        if (nd->depth)
                put_link(nd);
        error = should_follow_link(nd, &path, nd->flags & LOOKUP_FOLLOW,
                                   inode, seq);
        if (unlikely(error))
                return error;
in do_last() into
        error = should_follow_link(nd, &path, nd->flags & LOOKUP_FOLLOW,
                                   inode, seq);
	if (unlikely(error)) {
		if (nd->depth == 2) {
			struct saved *last = nd->stack[0];
			do_delayed_call(&last->done);
			if (!(nd->flags & LOOKUP_RCU))
				path_put(&last->link);
			nd->stack[0] = nd->stack[1];
			nd->depth--;
		}
		return error;
	}	
but I would really prefer the first approach - it allows to remove arseloads
of convoluted crap from do_last().
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro June 3, 2016, 10:23 p.m. UTC | #14

On Fri, Jun 03, 2016 at 03:00:02PM -0700, Linus Torvalds wrote:

> Is perhaps the "delayed_call" logic broken, and the symlink is free'd too early?
> 
> That whole set_delayed_call/do_delayed_call thing came in 4.5. Maybe
> something broke that logic, and we've executed the delayed freeing
> before we should have.
> 
> Normally it's done at terminate_walk() time. But I note that in
> walk_component(), we do put_link(nd) which does a do_delayed_call(),
> but does *not* do a clear_delayed_call(), so now I think a subsequent
> terminate_walk() might drop it *again*.

Nope - put_link() also decrements nd->depth.  No double calls there...

> I'm probably missing something, but I have to say that the old
> explicit "put_link()" callback logic was more obvious than the new
> delayed calls are.

It's not that.  It's explicit put_link() in do_last(), followed by
ESTALEOPEN and subsequent misbegotten "retry the last step on ESTALEOPEN"
looking at now-freed nd->last.name.  IOW, the bug predates delayed_call
stuff.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro June 3, 2016, 10:29 p.m. UTC | #15

On Fri, Jun 03, 2016 at 11:23:55PM +0100, Al Viro wrote:

> It's not that.  It's explicit put_link() in do_last(), followed by
> ESTALEOPEN and subsequent misbegotten "retry the last step on ESTALEOPEN"
> looking at now-freed nd->last.name.  IOW, the bug predates delayed_call
> stuff.

FWIW, I'd stepped into that in d63ff28f "namei: make should_follow_link() store
the link in nd->link", so it's 4.1+ mess.  delayed_call stuff is 4.4+...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds June 3, 2016, 10:36 p.m. UTC | #16

On Fri, Jun 3, 2016 at 3:23 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Jun 03, 2016 at 03:00:02PM -0700, Linus Torvalds wrote:
>>>
>> Normally it's done at terminate_walk() time. But I note that in
>> walk_component(), we do put_link(nd) which does a do_delayed_call(),
>> but does *not* do a clear_delayed_call(), so now I think a subsequent
>> terminate_walk() might drop it *again*.
>
> Nope - put_link() also decrements nd->depth.  No double calls there...

Yeah, I figured that out, and then continued to try to look at other cases..

Happy to hear that you seem to have figured it out.

But why did it apparently only start happening now?

             Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Oleg Drokin June 3, 2016, 10:42 p.m. UTC | #17

On Jun 3, 2016, at 6:36 PM, Linus Torvalds wrote:

> On Fri, Jun 3, 2016 at 3:23 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> On Fri, Jun 03, 2016 at 03:00:02PM -0700, Linus Torvalds wrote:
>>>> 
>>> Normally it's done at terminate_walk() time. But I note that in
>>> walk_component(), we do put_link(nd) which does a do_delayed_call(),
>>> but does *not* do a clear_delayed_call(), so now I think a subsequent
>>> terminate_walk() might drop it *again*.
>> 
>> Nope - put_link() also decrements nd->depth.  No double calls there...
> 
> Yeah, I figured that out, and then continued to try to look at other cases..
> 
> Happy to hear that you seem to have figured it out.
> 
> But why did it apparently only start happening now?

Apparently nobody runs NFS in a serious manner these days anymore.
EOPENSTALE is only used by NFS.


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro June 3, 2016, 10:43 p.m. UTC | #18

On Fri, Jun 03, 2016 at 03:36:22PM -0700, Linus Torvalds wrote:

> Happy to hear that you seem to have figured it out.
> 
> But why did it apparently only start happening now?

Oleg has started to use Lustre torture tests on NFS, that's all.  Note, BTW,
that first they'd triggered an oopsable bug (fairly easy to reproduce) in
nfs_atomic_open() that had been there for 3 years ;-/
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

NFS/d_splice_alias breakage

Commit Message

Comments

Patch