diff mbox

More parallel atomic_open/d_splice_alias fun with NFS and possibly more FSes.

Message ID 20160704030812.GI14480@ZenIV.linux.org.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Al Viro July 4, 2016, 3:08 a.m. UTC
On Sun, Jul 03, 2016 at 08:37:22PM -0400, Oleg Drokin wrote:

> Hm… This dates to sometime in 2006 and my memory is a bit hazy here.
> 
> I think when we called into the open, it went into fifo open and stuck there
> waiting for the other opener. Something like that. And we cannot really be stuck here
> because we are holding some locks that need to be released in predictable time.
> 
> This code is actually unreachable now because the server never returns an openhandle
> for special device nodes anymore (there's a comment about it in current staging tree,
> but I guess you are looking at some prior version).
> 
> I imagine device nodes might have represented a similar risk too, but it did not
> occur to me to test it separately and the testsuite does not do it either.
> 
> Directories do not get stuck when you open them so they are ok and we can
> atomically open them too, I guess.
> Symlinks are handled specially on the server and the open never returns
> the actual open handle for those, so this path is also unreachable with those.

Hmm...  How much does the safety of client depend upon the correctness of
server?

BTW, there's a fun issue in ll_revalidate_dentry(): there's nothing to
promise stability of ->d_parent in there, so uses of d_inode(dentry->d_parent)
are not safe.  That's independent from parallel lookups, and it's hard
to hit, but AFAICS it's not impossible to oops there.

Anyway, for Lustre the analogue of that NFS problem is here:
        } else if (!it_disposition(it, DISP_LOOKUP_NEG)  &&
                   !it_disposition(it, DISP_OPEN_CREATE)) {
                /* With DISP_OPEN_CREATE dentry will be
                 * instantiated in ll_create_it.
                 */
                LASSERT(!d_inode(*de));
                d_instantiate(*de, inode);
        }
AFAICS, this (on top of mainline) ought to work:

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Oleg Drokin July 4, 2016, 3:55 a.m. UTC | #1
On Jul 3, 2016, at 11:08 PM, Al Viro wrote:

> On Sun, Jul 03, 2016 at 08:37:22PM -0400, Oleg Drokin wrote:
> 
>> Hm… This dates to sometime in 2006 and my memory is a bit hazy here.
>> 
>> I think when we called into the open, it went into fifo open and stuck there
>> waiting for the other opener. Something like that. And we cannot really be stuck here
>> because we are holding some locks that need to be released in predictable time.
>> 
>> This code is actually unreachable now because the server never returns an openhandle
>> for special device nodes anymore (there's a comment about it in current staging tree,
>> but I guess you are looking at some prior version).
>> 
>> I imagine device nodes might have represented a similar risk too, but it did not
>> occur to me to test it separately and the testsuite does not do it either.
>> 
>> Directories do not get stuck when you open them so they are ok and we can
>> atomically open them too, I guess.
>> Symlinks are handled specially on the server and the open never returns
>> the actual open handle for those, so this path is also unreachable with those.
> 
> Hmm...  How much does the safety of client depend upon the correctness of
> server?

Quite a bit, actually. If you connect to an rogue Lustre server,
currently there are many ways it can crash the client.
I suspect this is true not just of Lustre, if e.g. NFS server starts to
send directory inodes with duplicated inode numbers or some such,
VFS would not be super happy about such "hardlinked" directories either.
This is before we even consider that it can feed you garbage data
to crash your apps (or substitute binaries to do something else).

> BTW, there's a fun issue in ll_revalidate_dentry(): there's nothing to
> promise stability of ->d_parent in there, so uses of d_inode(dentry->d_parent)

Yes, we actually had a discussion about that in March, we were not the only ones
affected, and I think it was decided that dget_parent() was a better solution
to get to the parent (I see ext4 has already converted).
I believe you cannot hit it in Lustre now due to Lustre locking magic, but
I'll create a patch to cover this anyway. Thanks for reminding me about this.

> are not safe.  That's independent from parallel lookups, and it's hard
> to hit, but AFAICS it's not impossible to oops there.
> 
> Anyway, for Lustre the analogue of that NFS problem is here:
>        } else if (!it_disposition(it, DISP_LOOKUP_NEG)  &&
>                   !it_disposition(it, DISP_OPEN_CREATE)) {
>                /* With DISP_OPEN_CREATE dentry will be
>                 * instantiated in ll_create_it.
>                 */
>                LASSERT(!d_inode(*de));
>                d_instantiate(*de, inode);
>        }

Hm… Do you mean that when we do come hashed here, with a negative dentry
and positive disposition and hit the assertion about inode not being NULL
(still staying negative, basically)?
This one we cannot hit because negative dentries are protected by a Lustre
dlm lock held by the parent directory. Any create in that parent directory
would invalidate the lock and once that happens, all negative dentries would
be killed.
Hmm… This probably means this is a dead code?
Ah, I guess it's not.
If we do a lookup and find this negative dentry (from 2+ threads) and THEN it gets invalidated and our two threads both race to instantiate it...
It does look like something that is quite hard to hit, but still looks like a race
that could happen.

> AFAICS, this (on top of mainline) ought to work:

Thanks, I'll give this a try.
> 
> diff --git a/drivers/staging/lustre/lustre/llite/namei.c b/drivers/staging/lustre/lustre/llite/namei.c
> index 5eba0eb..b8da5b4 100644
> --- a/drivers/staging/lustre/lustre/llite/namei.c
> +++ b/drivers/staging/lustre/lustre/llite/namei.c
> @@ -581,9 +581,11 @@ static int ll_atomic_open(struct inode *dir, struct dentry *dentry,
> 			  struct file *file, unsigned open_flags,
> 			  umode_t mode, int *opened)
> {
> +	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
> 	struct lookup_intent *it;
> 	struct dentry *de;
> 	long long lookup_flags = LOOKUP_OPEN;
> +	bool switched = false;
> 	int rc = 0;
> 
> 	CDEBUG(D_VFSTRACE, "VFS Op:name=%pd, dir="DFID"(%p),file %p,open_flags %x,mode %x opened %d\n",
> @@ -603,11 +605,28 @@ static int ll_atomic_open(struct inode *dir, struct dentry *dentry,
> 	it->it_flags = (open_flags & ~O_ACCMODE) | OPEN_FMODE(open_flags);
> 
> 	/* Dentry added to dcache tree in ll_lookup_it */
> +	if (!(open_flags & O_CREAT) && !d_unhashed(dentry)) {
> +		d_drop(dentry);
> +		switched = true;
> +	        dentry = d_alloc_parallel(dentry->d_parent,
> +					  &dentry->d_name, &wq);
> +		if (IS_ERR(dentry)) {
> +			rc = PTR_ERR(dentry);
> +			goto out_release;
> +		}
> +		if (unlikely(!d_in_lookup(dentry))) {
> +			rc = finish_no_open(file, dentry);
> +			goto out_release;
> +		}
> +	}
> +
> 	de = ll_lookup_it(dir, dentry, it, lookup_flags);
> 	if (IS_ERR(de))
> 		rc = PTR_ERR(de);
> 	else if (de)
> 		dentry = de;
> +	else if (switched)
> +		de = dget(dentry);
> 
> 	if (!rc) {
> 		if (it_disposition(it, DISP_OPEN_CREATE)) {
> @@ -648,6 +667,10 @@ static int ll_atomic_open(struct inode *dir, struct dentry *dentry,
> 	}
> 
> out_release:
> +	if (unlikely(switched)) {
> +		d_lookup_done(dentry);
> +		dput(dentry);
> +	}
> 	ll_intent_release(it);
> 	kfree(it);
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro July 5, 2016, 2:25 a.m. UTC | #2
On Sun, Jul 03, 2016 at 11:55:09PM -0400, Oleg Drokin wrote:
> Quite a bit, actually. If you connect to an rogue Lustre server,
> currently there are many ways it can crash the client.
> I suspect this is true not just of Lustre, if e.g. NFS server starts to
> send directory inodes with duplicated inode numbers or some such,
> VFS would not be super happy about such "hardlinked" directories either.
> This is before we even consider that it can feed you garbage data
> to crash your apps (or substitute binaries to do something else).

NFS client is at least supposed to try to be resistant to that.  As in,
"if an 0wn3d NFS server can be escalated to buggered client, it's a bug in
client and we are expected to try and fix it".

[snip]
> Thanks, I'll give this a try.

BTW, could you take a look at
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git#sendmsg.lustre?
It's a bunch of simplifications that became possible once sendmsg()/recvmsg()
switched to iov_iter, stopped mangling the iovecs and went for predictable
behaviour re advancing the iterator.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Drokin July 10, 2016, 5:01 p.m. UTC | #3
On Jul 4, 2016, at 10:25 PM, Al Viro wrote:

> BTW, could you take a look at
> git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git#sendmsg.lustre?
> It's a bunch of simplifications that became possible once sendmsg()/recvmsg()
> switched to iov_iter, stopped mangling the iovecs and went for predictable
> behaviour re advancing the iterator.

Thanks, this looks good to me and passes my testing (on tcp).

+typedef struct bio_vec lnet_kiov_t;

This I guess we'll need to just get rid of all lnet_kiov_t usage, but that's
something we can do ourselves, I guess.

Anyway, your patchset is based on old tree that no longer applies cleanly,
I rebased it to current staging tree to save you time in case
you want to go forward with it.
It's at git@github.com:verygreen/linux.git branch lustre-next-sendmsg

James, can you please give it a try on IB?

Bye,
    Oleg--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Simmons July 10, 2016, 6:14 p.m. UTC | #4
> On Jul 4, 2016, at 10:25 PM, Al Viro wrote:
> 
> > BTW, could you take a look at
> > git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git#sendmsg.lustre?
> > It's a bunch of simplifications that became possible once sendmsg()/recvmsg()
> > switched to iov_iter, stopped mangling the iovecs and went for predictable
> > behaviour re advancing the iterator.
> 
> Thanks, this looks good to me and passes my testing (on tcp).
> 
> +typedef struct bio_vec lnet_kiov_t;
> 
> This I guess we'll need to just get rid of all lnet_kiov_t usage, but that's
> something we can do ourselves, I guess.
> 
> Anyway, your patchset is based on old tree that no longer applies cleanly,
> I rebased it to current staging tree to save you time in case
> you want to go forward with it.
> It's at git@github.com:verygreen/linux.git branch lustre-next-sendmsg
> 
> James, can you please give it a try on IB?

Its broke for the ko2iblnd driver.

[  110.840583] LNet: Using FMR for registration
[  110.991747] LNet: Added LNI 10.37.248.137@o2ib1 [63/2560/0/180]
[  110.998211] ------------[ cut here ]------------
[  111.003012] kernel BUG at lib/iov_iter.c:513!
[  111.007545] invalid opcode: 0000 [#1] SMP
[  111.011731] Modules linked in: ko2iblnd(C) ptlrpc(C+) obdclass(C) 
ksocklnd(C) lnet(C) sha512_generic sha256_generic md5 crc32_generic crc3
2_pclmul libcfs(C) autofs4 ipmi_devintf auth_rpcgss nfsv4 dns_resolver 
8021q iptable_filter ip_tables x_tables ib_ipoib rdma_ucm ib_ucm ib_uv
erbs ib_umad rdma_cm configfs ib_cm iw_cm mlx4_ib ib_core dm_mirror 
dm_region_hash dm_log dm_multipath sg sd_mod joydev pcspkr dm_mod mpt3sas
 raid_class acpi_cpufreq ipmi_ssif ipmi_si ipmi_msghandler isci libsas 
scsi_transport_sas wmi tpm_tis tpm i2c_i801 ahci libahci libata scsi_m
od ehci_pci ehci_hcd button tcp_cubic nfsv3(E) nfs_acl(E) ipv6(E) nfs(E) 
lockd(E) sunrpc(E) grace(E) mlx4_en(E) mlx4_core(E) igb(E) i2c_algo_
bit(E) i2c_core(E) ptp(E) pps_core(E) hwmon(E)
[  111.086669] CPU: 6 PID: 11899 Comm: router_checker Tainted: G         C  
E   4.7.0-rc6+ #1
[  111.095248] Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.0a 02/19/2014
[  111.102040] task: ffff880826d32d80 ti: ffff880811d24000 task.ti: 
ffff880811d24000
[  111.109818] RIP: 0010:[<ffffffff8128daf2>]  [<ffffffff8128daf2>] 
iov_iter_kvec+0x22/0x30
[  111.118302] RSP: 0018:ffff880811d27b28  EFLAGS: 00010246
[  111.123806] RAX: 0000000000000000 RBX: ffff88105e037c00 RCX: 
0000000000000000
[  111.131111] RDX: 0000000000000000 RSI: 0000000000000005 RDI: 
ffff880811d27b78
[  111.138426] RBP: ffff880811d27b28 R08: 0000000000000000 R09: 
0000000000000000
[  111.145751] R10: 0000000000000000 R11: 00000000fffd19f7 R12: 
0000000000000000
[  111.153083] R13: 0000000000000000 R14: 000500010a25ca3b R15: 
ffff880811d27b78
[  111.160407] FS:  0000000000000000(0000) GS:ffff88107fd00000(0000) 
knlGS:0000000000000000
[  111.168797] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  111.174717] CR2: 00007fd075c48945 CR3: 000000105a9ec000 CR4: 
00000000000406e0
[  111.182031] Stack:
[  111.184224]  ffff880811d27be8 ffffffffa04012bd 0000000000000000 
ffff880811d27b40
[  111.192206]  0000000000000000 0000000000000000 0000000000000000 
0000000000000000
[  111.192206]  0000000000000000 0000000000000000 0000000000000000 
0000000000000000
[  111.200196]  0000880800000002 ffff88105e394000 0000000000000000 
0000000000000000
[  111.208179] Call Trace:
[  111.210818]  [<ffffffffa04012bd>] kiblnd_send+0x51d/0x9e0 [ko2iblnd]
[  111.217370]  [<ffffffffa06ec6bd>] lnet_ni_send+0x3d/0xe0 [lnet]
[  111.223487]  [<ffffffffa06ee223>] lnet_send+0x6b3/0xc80 [lnet]
[  111.229501]  [<ffffffffa06eeb58>] LNetGet+0x368/0x650 [lnet]
[  111.235346]  [<ffffffffa0692a50>] ? cfs_percpt_lock+0x50/0x110 [libcfs]
[  111.242139]  [<ffffffffa06f4d8f>] lnet_ping_router_locked+0x20f/0x840 
[lnet]
[  111.249384]  [<ffffffffa06f5909>] lnet_router_checker+0xd9/0x490 [lnet]
[  111.256192]  [<ffffffff8108347d>] ? default_wake_function+0xd/0x10
[  111.262549]  [<ffffffff810923f1>] ? __wake_up_common+0x51/0x80
[  111.268562]  [<ffffffffa06f5830>] ? lnet_prune_rc_data+0x470/0x470 
[lnet]
[  111.275544]  [<ffffffff81508f9b>] ? schedule+0x3b/0xa0
[  111.280871]  [<ffffffffa06f5830>] ? lnet_prune_rc_data+0x470/0x470 
[lnet]
[  111.287849]  [<ffffffff810779d7>] kthread+0xc7/0xe0
[  111.292904]  [<ffffffff8150c3cf>] ret_from_fork+0x1f/0x40
[  111.298475]  [<ffffffff81077910>] ? 
kthread_freezable_should_stop+0x70/0x70
[  111.305631] Code: 2e 0f 1f 84 00 00 00 00 00 55 40 f6 c6 02 48 89 e5 74 
18 89 37 48 89 57 18 48 89 4f 20 48 c7 47 08 00 00 00 00 4c 89 47 10 c9 c3 
<0f> 0b eb fe 66 2e 0f 1f 84 00 00 00 00 00 48 8b 47 10 55 48 89
[  111.329528] RIP  [<ffffffff8128daf2>] iov_iter_kvec+0x22/0x30
[  111.335533]  RSP <ffff880811d27b28>
[  111.339360] ---[ end trace 1ea9288f558e2c8d ]---
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro July 11, 2016, 1:01 a.m. UTC | #5
On Sun, Jul 10, 2016 at 07:14:18PM +0100, James Simmons wrote:

> [  111.210818]  [<ffffffffa04012bd>] kiblnd_send+0x51d/0x9e0 [ko2iblnd]

Mea culpa - in kiblnd_send() this
        if (payload_kiov)
                iov_iter_bvec(&from, ITER_BVEC | WRITE,
                                payload_kiov, payload_niov, payload_nob);
        else
                iov_iter_kvec(&from, ITER_BVEC | WRITE,
                                payload_iov, payload_niov, payload_nob);
should have s/BVEC/KVEC/ in the iov_iter_kvec() arguments.  Cut'n'paste
braindamage...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro July 11, 2016, 1:03 a.m. UTC | #6
On Mon, Jul 11, 2016 at 02:01:13AM +0100, Al Viro wrote:
> On Sun, Jul 10, 2016 at 07:14:18PM +0100, James Simmons wrote:
> 
> > [  111.210818]  [<ffffffffa04012bd>] kiblnd_send+0x51d/0x9e0 [ko2iblnd]
> 
> Mea culpa - in kiblnd_send() this
>         if (payload_kiov)
>                 iov_iter_bvec(&from, ITER_BVEC | WRITE,
>                                 payload_kiov, payload_niov, payload_nob);
>         else
>                 iov_iter_kvec(&from, ITER_BVEC | WRITE,
>                                 payload_iov, payload_niov, payload_nob);
> should have s/BVEC/KVEC/ in the iov_iter_kvec() arguments.  Cut'n'paste
> braindamage...

PS: That was introduced in the last commit in that pile - "lustre: introduce
lnet_copy_{k,}iov2iter(), kill lnet_copy_{k,}iov2{k,}iov()".
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Simmons July 11, 2016, 5:15 p.m. UTC | #7
> On Sun, Jul 10, 2016 at 07:14:18PM +0100, James Simmons wrote:
> 
> > [  111.210818]  [<ffffffffa04012bd>] kiblnd_send+0x51d/0x9e0 [ko2iblnd]
> 
> Mea culpa - in kiblnd_send() this
>         if (payload_kiov)
>                 iov_iter_bvec(&from, ITER_BVEC | WRITE,
>                                 payload_kiov, payload_niov, payload_nob);
>         else
>                 iov_iter_kvec(&from, ITER_BVEC | WRITE,
>                                 payload_iov, payload_niov, payload_nob);
> should have s/BVEC/KVEC/ in the iov_iter_kvec() arguments.  Cut'n'paste
> braindamage...

That is the fix. Also I believe payload_nob should be payload_nob + 
payload_offset instead. I will send a patch that against Oleg's tree
that address these issues.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Drokin July 11, 2016, 10:54 p.m. UTC | #8
On Jul 10, 2016, at 9:03 PM, Al Viro wrote:

> On Mon, Jul 11, 2016 at 02:01:13AM +0100, Al Viro wrote:
>> On Sun, Jul 10, 2016 at 07:14:18PM +0100, James Simmons wrote:
>> 
>>> [  111.210818]  [<ffffffffa04012bd>] kiblnd_send+0x51d/0x9e0 [ko2iblnd]
>> 
>> Mea culpa - in kiblnd_send() this
>>        if (payload_kiov)
>>                iov_iter_bvec(&from, ITER_BVEC | WRITE,
>>                                payload_kiov, payload_niov, payload_nob);
>>        else
>>                iov_iter_kvec(&from, ITER_BVEC | WRITE,
>>                                payload_iov, payload_niov, payload_nob);
>> should have s/BVEC/KVEC/ in the iov_iter_kvec() arguments.  Cut'n'paste
>> braindamage...
> 
> PS: That was introduced in the last commit in that pile - "lustre: introduce
> lnet_copy_{k,}iov2iter(), kill lnet_copy_{k,}iov2{k,}iov()".

Is this something you plan to submit to Linus or should I just submit this to
Greg along with other changes?

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/staging/lustre/lustre/llite/namei.c b/drivers/staging/lustre/lustre/llite/namei.c
index 5eba0eb..b8da5b4 100644
--- a/drivers/staging/lustre/lustre/llite/namei.c
+++ b/drivers/staging/lustre/lustre/llite/namei.c
@@ -581,9 +581,11 @@  static int ll_atomic_open(struct inode *dir, struct dentry *dentry,
 			  struct file *file, unsigned open_flags,
 			  umode_t mode, int *opened)
 {
+	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
 	struct lookup_intent *it;
 	struct dentry *de;
 	long long lookup_flags = LOOKUP_OPEN;
+	bool switched = false;
 	int rc = 0;
 
 	CDEBUG(D_VFSTRACE, "VFS Op:name=%pd, dir="DFID"(%p),file %p,open_flags %x,mode %x opened %d\n",
@@ -603,11 +605,28 @@  static int ll_atomic_open(struct inode *dir, struct dentry *dentry,
 	it->it_flags = (open_flags & ~O_ACCMODE) | OPEN_FMODE(open_flags);
 
 	/* Dentry added to dcache tree in ll_lookup_it */
+	if (!(open_flags & O_CREAT) && !d_unhashed(dentry)) {
+		d_drop(dentry);
+		switched = true;
+	        dentry = d_alloc_parallel(dentry->d_parent,
+					  &dentry->d_name, &wq);
+		if (IS_ERR(dentry)) {
+			rc = PTR_ERR(dentry);
+			goto out_release;
+		}
+		if (unlikely(!d_in_lookup(dentry))) {
+			rc = finish_no_open(file, dentry);
+			goto out_release;
+		}
+	}
+
 	de = ll_lookup_it(dir, dentry, it, lookup_flags);
 	if (IS_ERR(de))
 		rc = PTR_ERR(de);
 	else if (de)
 		dentry = de;
+	else if (switched)
+		de = dget(dentry);
 
 	if (!rc) {
 		if (it_disposition(it, DISP_OPEN_CREATE)) {
@@ -648,6 +667,10 @@  static int ll_atomic_open(struct inode *dir, struct dentry *dentry,
 	}
 
 out_release:
+	if (unlikely(switched)) {
+		d_lookup_done(dentry);
+		dput(dentry);
+	}
 	ll_intent_release(it);
 	kfree(it);