diff mbox

[1/2] nfs: Fix spurios EPERM when mkdir of existing dentry

Message ID 1467870827-2959489-2-git-send-email-green@linuxhacker.ru
State New
Headers show

Commit Message

Oleg Drokin July 7, 2016, 5:53 a.m. UTC
It's great when we can shave an extra RPC, but not at the expense
of correctness.
We should not return EPERM (from vfs_create/mknod/mkdir) if the
name already exists, even if we have no write access in parent.

Since the check in nfs_permission is clearly not enough to stave
off this, just throw in the extra READ access to actually
go through.

Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
---
 fs/nfs/dir.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Trond Myklebust July 7, 2016, 4:16 p.m. UTC | #1
> On Jul 7, 2016, at 01:53, Oleg Drokin <green@linuxhacker.ru> wrote:
> 
> It's great when we can shave an extra RPC, but not at the expense
> of correctness.
> We should not return EPERM (from vfs_create/mknod/mkdir) if the
> name already exists, even if we have no write access in parent.
> 
> Since the check in nfs_permission is clearly not enough to stave
> off this, just throw in the extra READ access to actually
> go through.
> 
> Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
> ---
> fs/nfs/dir.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> index d8015a0..8c7835b 100644
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -1383,8 +1383,10 @@ struct dentry *nfs_lookup(struct inode *dir, struct dentry * dentry, unsigned in
> 	/*
> 	 * If we're doing an exclusive create, optimize away the lookup
> 	 * but don't hash the dentry.
> +	 * This optimization only works if we can write in the parent.
> 	 */
> -	if (nfs_is_exclusive_create(dir, flags))
> +	if (nfs_is_exclusive_create(dir, flags) &&
> +	    (inode_permission(dir, MAY_WRITE | MAY_READ | MAY_EXEC) == 0))
> 		return NULL;
> 

NACK. The only write permission we should care about on the client side is whether or not the filesystem is mounted read-only. All other permissions are checked by the server.

Cheers
  Trond

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Drokin July 7, 2016, 4:52 p.m. UTC | #2
On Jul 7, 2016, at 12:16 PM, Trond Myklebust wrote:

> 
>> On Jul 7, 2016, at 01:53, Oleg Drokin <green@linuxhacker.ru> wrote:
>> 
>> It's great when we can shave an extra RPC, but not at the expense
>> of correctness.
>> We should not return EPERM (from vfs_create/mknod/mkdir) if the
>> name already exists, even if we have no write access in parent.
>> 
>> Since the check in nfs_permission is clearly not enough to stave
>> off this, just throw in the extra READ access to actually
>> go through.
>> 
>> Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
>> ---
>> fs/nfs/dir.c | 4 +++-
>> 1 file changed, 3 insertions(+), 1 deletion(-)
>> 
>> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
>> index d8015a0..8c7835b 100644
>> --- a/fs/nfs/dir.c
>> +++ b/fs/nfs/dir.c
>> @@ -1383,8 +1383,10 @@ struct dentry *nfs_lookup(struct inode *dir, struct dentry * dentry, unsigned in
>> 	/*
>> 	 * If we're doing an exclusive create, optimize away the lookup
>> 	 * but don't hash the dentry.
>> +	 * This optimization only works if we can write in the parent.
>> 	 */
>> -	if (nfs_is_exclusive_create(dir, flags))
>> +	if (nfs_is_exclusive_create(dir, flags) &&
>> +	    (inode_permission(dir, MAY_WRITE | MAY_READ | MAY_EXEC) == 0))
>> 		return NULL;
>> 
> 
> NACK. The only write permission we should care about on the client side is whether or not the filesystem is mounted read-only. All other permissions are checked by the server.

Right. This was mostly a discussion piece.
The problem here is nfs_permission() returns 0 if you check for
inode_permission(dir, MAY_WRITE | MAY_EXEC) (as in may_create), but then
some other checks in the kernel still catch on to the fact that the directory
is not writeable, so we have a premature failure with EPERM and server never sees
this request which breaks things.

(the read-only mount is not handled as well at the moment of course and my patch
does not address this issue either, but it's easier to address in the VFS, like
in filename_create() or something).

I see that two major consumers of this nfs_permission MAY_WRITE|!MAY_READ check
are creates and deletes and with deletes we had a lookup already, so it already
looked up the child and revalidated the parent.
For creates, a revalidation still might be needed, I guess and that was the main driver
behind this check? And that only when you do current dir creates, because otherwise
the parent would have been revalidated in lookup?
Is this the major case why that check is actually there?

Just trying to see how to approach this better without breaking the applications.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust July 7, 2016, 4:59 p.m. UTC | #3
> On Jul 7, 2016, at 12:52, Oleg Drokin <green@linuxhacker.ru> wrote:
> 
> 
> On Jul 7, 2016, at 12:16 PM, Trond Myklebust wrote:
> 
>> 
>>> On Jul 7, 2016, at 01:53, Oleg Drokin <green@linuxhacker.ru> wrote:
>>> 
>>> It's great when we can shave an extra RPC, but not at the expense
>>> of correctness.
>>> We should not return EPERM (from vfs_create/mknod/mkdir) if the
>>> name already exists, even if we have no write access in parent.
>>> 
>>> Since the check in nfs_permission is clearly not enough to stave
>>> off this, just throw in the extra READ access to actually
>>> go through.
>>> 
>>> Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
>>> ---
>>> fs/nfs/dir.c | 4 +++-
>>> 1 file changed, 3 insertions(+), 1 deletion(-)
>>> 
>>> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
>>> index d8015a0..8c7835b 100644
>>> --- a/fs/nfs/dir.c
>>> +++ b/fs/nfs/dir.c
>>> @@ -1383,8 +1383,10 @@ struct dentry *nfs_lookup(struct inode *dir, struct dentry * dentry, unsigned in
>>> 	/*
>>> 	 * If we're doing an exclusive create, optimize away the lookup
>>> 	 * but don't hash the dentry.
>>> +	 * This optimization only works if we can write in the parent.
>>> 	 */
>>> -	if (nfs_is_exclusive_create(dir, flags))
>>> +	if (nfs_is_exclusive_create(dir, flags) &&
>>> +	    (inode_permission(dir, MAY_WRITE | MAY_READ | MAY_EXEC) == 0))
>>> 		return NULL;
>>> 
>> 
>> NACK. The only write permission we should care about on the client side is whether or not the filesystem is mounted read-only. All other permissions are checked by the server.
> 
> Right. This was mostly a discussion piece.
> The problem here is nfs_permission() returns 0 if you check for
> inode_permission(dir, MAY_WRITE | MAY_EXEC) (as in may_create), but then
> some other checks in the kernel still catch on to the fact that the directory
> is not writeable, so we have a premature failure with EPERM and server never sees
> this request which breaks things.

Are these VFS level checks? Which ones?

> 
> (the read-only mount is not handled as well at the moment of course and my patch
> does not address this issue either, but it's easier to address in the VFS, like
> in filename_create() or something).
> 
> I see that two major consumers of this nfs_permission MAY_WRITE|!MAY_READ check
> are creates and deletes and with deletes we had a lookup already, so it already
> looked up the child and revalidated the parent.
> For creates, a revalidation still might be needed, I guess and that was the main driver
> behind this check? And that only when you do current dir creates, because otherwise
> the parent would have been revalidated in lookup?
> Is this the major case why that check is actually there?
> 
> Just trying to see how to approach this better without breaking the applications.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Drokin July 7, 2016, 5:07 p.m. UTC | #4
On Jul 7, 2016, at 12:59 PM, Trond Myklebust wrote:

> 
>> On Jul 7, 2016, at 12:52, Oleg Drokin <green@linuxhacker.ru> wrote:
>> 
>> 
>> On Jul 7, 2016, at 12:16 PM, Trond Myklebust wrote:
>> 
>>> 
>>>> On Jul 7, 2016, at 01:53, Oleg Drokin <green@linuxhacker.ru> wrote:
>>>> 
>>>> It's great when we can shave an extra RPC, but not at the expense
>>>> of correctness.
>>>> We should not return EPERM (from vfs_create/mknod/mkdir) if the
>>>> name already exists, even if we have no write access in parent.
>>>> 
>>>> Since the check in nfs_permission is clearly not enough to stave
>>>> off this, just throw in the extra READ access to actually
>>>> go through.
>>>> 
>>>> Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
>>>> ---
>>>> fs/nfs/dir.c | 4 +++-
>>>> 1 file changed, 3 insertions(+), 1 deletion(-)
>>>> 
>>>> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
>>>> index d8015a0..8c7835b 100644
>>>> --- a/fs/nfs/dir.c
>>>> +++ b/fs/nfs/dir.c
>>>> @@ -1383,8 +1383,10 @@ struct dentry *nfs_lookup(struct inode *dir, struct dentry * dentry, unsigned in
>>>> 	/*
>>>> 	 * If we're doing an exclusive create, optimize away the lookup
>>>> 	 * but don't hash the dentry.
>>>> +	 * This optimization only works if we can write in the parent.
>>>> 	 */
>>>> -	if (nfs_is_exclusive_create(dir, flags))
>>>> +	if (nfs_is_exclusive_create(dir, flags) &&
>>>> +	    (inode_permission(dir, MAY_WRITE | MAY_READ | MAY_EXEC) == 0))
>>>> 		return NULL;
>>>> 
>>> 
>>> NACK. The only write permission we should care about on the client side is whether or not the filesystem is mounted read-only. All other permissions are checked by the server.
>> 
>> Right. This was mostly a discussion piece.
>> The problem here is nfs_permission() returns 0 if you check for
>> inode_permission(dir, MAY_WRITE | MAY_EXEC) (as in may_create), but then
>> some other checks in the kernel still catch on to the fact that the directory
>> is not writeable, so we have a premature failure with EPERM and server never sees
>> this request which breaks things.
> 
> Are these VFS level checks? Which ones?

Yes, VFS level I believe.
For Lustre it's may_create() from vfs_mkdir() that stops us short
and the Lustre patch is effective.
but for NFS this must be something else and I did not trace
it completely. One of the security checks, I guess?
if NFS patch is changed to check inode_permission(dir, MAY_OPEN | MAY_EXEC)
as in Lustre, that returns 0 no matter what.

This is trivial to reproduce too.

>> (the read-only mount is not handled as well at the moment of course and my patch
>> does not address this issue either, but it's easier to address in the VFS, like
>> in filename_create() or something).
>> 
>> I see that two major consumers of this nfs_permission MAY_WRITE|!MAY_READ check
>> are creates and deletes and with deletes we had a lookup already, so it already
>> looked up the child and revalidated the parent.
>> For creates, a revalidation still might be needed, I guess and that was the main driver
>> behind this check? And that only when you do current dir creates, because otherwise
>> the parent would have been revalidated in lookup?
>> Is this the major case why that check is actually there?
>> 
>> Just trying to see how to approach this better without breaking the applications.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust July 7, 2016, 5:27 p.m. UTC | #5
> On Jul 7, 2016, at 13:07, Oleg Drokin <green@linuxhacker.ru> wrote:
> 
> 
> On Jul 7, 2016, at 12:59 PM, Trond Myklebust wrote:
> 
>> 
>>> On Jul 7, 2016, at 12:52, Oleg Drokin <green@linuxhacker.ru> wrote:
>>> 
>>> 
>>> On Jul 7, 2016, at 12:16 PM, Trond Myklebust wrote:
>>> 
>>>> 
>>>>> On Jul 7, 2016, at 01:53, Oleg Drokin <green@linuxhacker.ru> wrote:
>>>>> 
>>>>> It's great when we can shave an extra RPC, but not at the expense
>>>>> of correctness.
>>>>> We should not return EPERM (from vfs_create/mknod/mkdir) if the
>>>>> name already exists, even if we have no write access in parent.
>>>>> 
>>>>> Since the check in nfs_permission is clearly not enough to stave
>>>>> off this, just throw in the extra READ access to actually
>>>>> go through.
>>>>> 
>>>>> Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
>>>>> ---
>>>>> fs/nfs/dir.c | 4 +++-
>>>>> 1 file changed, 3 insertions(+), 1 deletion(-)
>>>>> 
>>>>> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
>>>>> index d8015a0..8c7835b 100644
>>>>> --- a/fs/nfs/dir.c
>>>>> +++ b/fs/nfs/dir.c
>>>>> @@ -1383,8 +1383,10 @@ struct dentry *nfs_lookup(struct inode *dir, struct dentry * dentry, unsigned in
>>>>> 	/*
>>>>> 	 * If we're doing an exclusive create, optimize away the lookup
>>>>> 	 * but don't hash the dentry.
>>>>> +	 * This optimization only works if we can write in the parent.
>>>>> 	 */
>>>>> -	if (nfs_is_exclusive_create(dir, flags))
>>>>> +	if (nfs_is_exclusive_create(dir, flags) &&
>>>>> +	    (inode_permission(dir, MAY_WRITE | MAY_READ | MAY_EXEC) == 0))
>>>>> 		return NULL;
>>>>> 
>>>> 
>>>> NACK. The only write permission we should care about on the client side is whether or not the filesystem is mounted read-only. All other permissions are checked by the server.
>>> 
>>> Right. This was mostly a discussion piece.
>>> The problem here is nfs_permission() returns 0 if you check for
>>> inode_permission(dir, MAY_WRITE | MAY_EXEC) (as in may_create), but then
>>> some other checks in the kernel still catch on to the fact that the directory
>>> is not writeable, so we have a premature failure with EPERM and server never sees
>>> this request which breaks things.
>> 
>> Are these VFS level checks? Which ones?
> 
> Yes, VFS level I believe.
> For Lustre it's may_create() from vfs_mkdir() that stops us short
> and the Lustre patch is effective.
> but for NFS this must be something else and I did not trace
> it completely. One of the security checks, I guess?
> if NFS patch is changed to check inode_permission(dir, MAY_OPEN | MAY_EXEC)
> as in Lustre, that returns 0 no matter what.
> 
> This is trivial to reproduce too.

So, should we be ignoring the MAY_EXEC flag when the VFS asks for inode_permission(dir, MAY_WRITE|MAY_EXEC)? I suspect that is the problem here.

> 
>>> (the read-only mount is not handled as well at the moment of course and my patch
>>> does not address this issue either, but it's easier to address in the VFS, like
>>> in filename_create() or something).
>>> 
>>> I see that two major consumers of this nfs_permission MAY_WRITE|!MAY_READ check
>>> are creates and deletes and with deletes we had a lookup already, so it already
>>> looked up the child and revalidated the parent.
>>> For creates, a revalidation still might be needed, I guess and that was the main driver
>>> behind this check? And that only when you do current dir creates, because otherwise
>>> the parent would have been revalidated in lookup?
>>> Is this the major case why that check is actually there?
>>> 
>>> Just trying to see how to approach this better without breaking the applications.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Drokin July 7, 2016, 5:32 p.m. UTC | #6
On Jul 7, 2016, at 1:27 PM, Trond Myklebust wrote:

> 
>> On Jul 7, 2016, at 13:07, Oleg Drokin <green@linuxhacker.ru> wrote:
>> 
>> 
>> On Jul 7, 2016, at 12:59 PM, Trond Myklebust wrote:
>> 
>>> 
>>>> On Jul 7, 2016, at 12:52, Oleg Drokin <green@linuxhacker.ru> wrote:
>>>> 
>>>> 
>>>> On Jul 7, 2016, at 12:16 PM, Trond Myklebust wrote:
>>>> 
>>>>> 
>>>>>> On Jul 7, 2016, at 01:53, Oleg Drokin <green@linuxhacker.ru> wrote:
>>>>>> 
>>>>>> It's great when we can shave an extra RPC, but not at the expense
>>>>>> of correctness.
>>>>>> We should not return EPERM (from vfs_create/mknod/mkdir) if the
>>>>>> name already exists, even if we have no write access in parent.
>>>>>> 
>>>>>> Since the check in nfs_permission is clearly not enough to stave
>>>>>> off this, just throw in the extra READ access to actually
>>>>>> go through.
>>>>>> 
>>>>>> Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
>>>>>> ---
>>>>>> fs/nfs/dir.c | 4 +++-
>>>>>> 1 file changed, 3 insertions(+), 1 deletion(-)
>>>>>> 
>>>>>> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
>>>>>> index d8015a0..8c7835b 100644
>>>>>> --- a/fs/nfs/dir.c
>>>>>> +++ b/fs/nfs/dir.c
>>>>>> @@ -1383,8 +1383,10 @@ struct dentry *nfs_lookup(struct inode *dir, struct dentry * dentry, unsigned in
>>>>>> 	/*
>>>>>> 	 * If we're doing an exclusive create, optimize away the lookup
>>>>>> 	 * but don't hash the dentry.
>>>>>> +	 * This optimization only works if we can write in the parent.
>>>>>> 	 */
>>>>>> -	if (nfs_is_exclusive_create(dir, flags))
>>>>>> +	if (nfs_is_exclusive_create(dir, flags) &&
>>>>>> +	    (inode_permission(dir, MAY_WRITE | MAY_READ | MAY_EXEC) == 0))
>>>>>> 		return NULL;
>>>>>> 
>>>>> 
>>>>> NACK. The only write permission we should care about on the client side is whether or not the filesystem is mounted read-only. All other permissions are checked by the server.
>>>> 
>>>> Right. This was mostly a discussion piece.
>>>> The problem here is nfs_permission() returns 0 if you check for
>>>> inode_permission(dir, MAY_WRITE | MAY_EXEC) (as in may_create), but then
>>>> some other checks in the kernel still catch on to the fact that the directory
>>>> is not writeable, so we have a premature failure with EPERM and server never sees
>>>> this request which breaks things.
>>> 
>>> Are these VFS level checks? Which ones?
>> 
>> Yes, VFS level I believe.
>> For Lustre it's may_create() from vfs_mkdir() that stops us short
>> and the Lustre patch is effective.
>> but for NFS this must be something else and I did not trace
>> it completely. One of the security checks, I guess?
>> if NFS patch is changed to check inode_permission(dir, MAY_OPEN | MAY_EXEC)
>> as in Lustre, that returns 0 no matter what.
>> 
>> This is trivial to reproduce too.
> 
> So, should we be ignoring the MAY_EXEC flag when the VFS asks for inode_permission(dir, MAY_WRITE|MAY_EXEC)? I suspect that is the problem here.

I am not sure what do you mean here.

When vfs asks for inode_permission(dir, MAY_WRITE|MAY_EXEC) NFS already returns 0
(i.e. everything is ok, go ahead).
But it's all futile because some other check somewhere still catches on
to the fact that the directory is not writeable and does not let the mkdir
to actually happen in the FS itself.

>>>> (the read-only mount is not handled as well at the moment of course and my patch
>>>> does not address this issue either, but it's easier to address in the VFS, like
>>>> in filename_create() or something).
>>>> 
>>>> I see that two major consumers of this nfs_permission MAY_WRITE|!MAY_READ check
>>>> are creates and deletes and with deletes we had a lookup already, so it already
>>>> looked up the child and revalidated the parent.
>>>> For creates, a revalidation still might be needed, I guess and that was the main driver
>>>> behind this check? And that only when you do current dir creates, because otherwise
>>>> the parent would have been revalidated in lookup?
>>>> Is this the major case why that check is actually there?
>>>> 
>>>> Just trying to see how to approach this better without breaking the applications.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Drokin July 7, 2016, 9:52 p.m. UTC | #7
On Jul 7, 2016, at 1:27 PM, Trond Myklebust wrote:

> 
>> On Jul 7, 2016, at 13:07, Oleg Drokin <green@linuxhacker.ru> wrote:
>> 
>> 
>> On Jul 7, 2016, at 12:59 PM, Trond Myklebust wrote:
>> 
>>> 
>>>> On Jul 7, 2016, at 12:52, Oleg Drokin <green@linuxhacker.ru> wrote:
>>>> 
>>>> 
>>>> On Jul 7, 2016, at 12:16 PM, Trond Myklebust wrote:
>>>> 
>>>>> 
>>>>>> On Jul 7, 2016, at 01:53, Oleg Drokin <green@linuxhacker.ru> wrote:
>>>>>> 
>>>>>> It's great when we can shave an extra RPC, but not at the expense
>>>>>> of correctness.
>>>>>> We should not return EPERM (from vfs_create/mknod/mkdir) if the
>>>>>> name already exists, even if we have no write access in parent.
>>>>>> 
>>>>>> Since the check in nfs_permission is clearly not enough to stave
>>>>>> off this, just throw in the extra READ access to actually
>>>>>> go through.
>>>>>> 
>>>>>> Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
>>>>>> ---
>>>>>> fs/nfs/dir.c | 4 +++-
>>>>>> 1 file changed, 3 insertions(+), 1 deletion(-)
>>>>>> 
>>>>>> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
>>>>>> index d8015a0..8c7835b 100644
>>>>>> --- a/fs/nfs/dir.c
>>>>>> +++ b/fs/nfs/dir.c
>>>>>> @@ -1383,8 +1383,10 @@ struct dentry *nfs_lookup(struct inode *dir, struct dentry * dentry, unsigned in
>>>>>> 	/*
>>>>>> 	 * If we're doing an exclusive create, optimize away the lookup
>>>>>> 	 * but don't hash the dentry.
>>>>>> +	 * This optimization only works if we can write in the parent.
>>>>>> 	 */
>>>>>> -	if (nfs_is_exclusive_create(dir, flags))
>>>>>> +	if (nfs_is_exclusive_create(dir, flags) &&
>>>>>> +	    (inode_permission(dir, MAY_WRITE | MAY_READ | MAY_EXEC) == 0))
>>>>>> 		return NULL;
>>>>>> 
>>>>> 
>>>>> NACK. The only write permission we should care about on the client side is whether or not the filesystem is mounted read-only. All other permissions are checked by the server.
>>>> 
>>>> Right. This was mostly a discussion piece.
>>>> The problem here is nfs_permission() returns 0 if you check for
>>>> inode_permission(dir, MAY_WRITE | MAY_EXEC) (as in may_create), but then
>>>> some other checks in the kernel still catch on to the fact that the directory
>>>> is not writeable, so we have a premature failure with EPERM and server never sees
>>>> this request which breaks things.
>>> 
>>> Are these VFS level checks? Which ones?
>> 
>> Yes, VFS level I believe.
>> For Lustre it's may_create() from vfs_mkdir() that stops us short
>> and the Lustre patch is effective.
>> but for NFS this must be something else and I did not trace
>> it completely. One of the security checks, I guess?
>> if NFS patch is changed to check inode_permission(dir, MAY_OPEN | MAY_EXEC)
>> as in Lustre, that returns 0 no matter what.
>> 
>> This is trivial to reproduce too.
> 
> So, should we be ignoring the MAY_EXEC flag when the VFS asks for inode_permission(dir, MAY_WRITE|MAY_EXEC)? I suspect that is the problem here.

You know, I have been wrong here.
VFS is not in the way.
it's the NFS server that's returning us the error:
[  326.069066] NFS: mkdir(0:44/12), test
[  326.070510] nfsd_dispatch: vers 3 proc 9
[  326.071612] nfsd: MKDIR(3)    16: 01010001 00000000 0000000c 4b3b7a92 00000000 00000000 test
[  326.071617] nfsd: fh_verify(16: 01010001 00000000 0000000c 4b3b7a92 00000000 00000000)
[  326.071684] fh_verify: /racer permission failure, acc=3, error=13

So I guess nfsd wants to be lazy too?
What side do you think this is best fixed then? the client or the server?--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust July 7, 2016, 11:17 p.m. UTC | #8
> On Jul 7, 2016, at 17:52, Oleg Drokin <green@linuxhacker.ru> wrote:
> 
> 
> On Jul 7, 2016, at 1:27 PM, Trond Myklebust wrote:
> 
>> 
>>> On Jul 7, 2016, at 13:07, Oleg Drokin <green@linuxhacker.ru> wrote:
>>> 
>>> 
>>> On Jul 7, 2016, at 12:59 PM, Trond Myklebust wrote:
>>> 
>>>> 
>>>>> On Jul 7, 2016, at 12:52, Oleg Drokin <green@linuxhacker.ru> wrote:
>>>>> 
>>>>> 
>>>>> On Jul 7, 2016, at 12:16 PM, Trond Myklebust wrote:
>>>>> 
>>>>>> 
>>>>>>> On Jul 7, 2016, at 01:53, Oleg Drokin <green@linuxhacker.ru> wrote:
>>>>>>> 
>>>>>>> It's great when we can shave an extra RPC, but not at the expense
>>>>>>> of correctness.
>>>>>>> We should not return EPERM (from vfs_create/mknod/mkdir) if the
>>>>>>> name already exists, even if we have no write access in parent.
>>>>>>> 
>>>>>>> Since the check in nfs_permission is clearly not enough to stave
>>>>>>> off this, just throw in the extra READ access to actually
>>>>>>> go through.
>>>>>>> 
>>>>>>> Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
>>>>>>> ---
>>>>>>> fs/nfs/dir.c | 4 +++-
>>>>>>> 1 file changed, 3 insertions(+), 1 deletion(-)
>>>>>>> 
>>>>>>> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
>>>>>>> index d8015a0..8c7835b 100644
>>>>>>> --- a/fs/nfs/dir.c
>>>>>>> +++ b/fs/nfs/dir.c
>>>>>>> @@ -1383,8 +1383,10 @@ struct dentry *nfs_lookup(struct inode *dir, struct dentry * dentry, unsigned in
>>>>>>> 	/*
>>>>>>> 	 * If we're doing an exclusive create, optimize away the lookup
>>>>>>> 	 * but don't hash the dentry.
>>>>>>> +	 * This optimization only works if we can write in the parent.
>>>>>>> 	 */
>>>>>>> -	if (nfs_is_exclusive_create(dir, flags))
>>>>>>> +	if (nfs_is_exclusive_create(dir, flags) &&
>>>>>>> +	    (inode_permission(dir, MAY_WRITE | MAY_READ | MAY_EXEC) == 0))
>>>>>>> 		return NULL;
>>>>>>> 
>>>>>> 
>>>>>> NACK. The only write permission we should care about on the client side is whether or not the filesystem is mounted read-only. All other permissions are checked by the server.
>>>>> 
>>>>> Right. This was mostly a discussion piece.
>>>>> The problem here is nfs_permission() returns 0 if you check for
>>>>> inode_permission(dir, MAY_WRITE | MAY_EXEC) (as in may_create), but then
>>>>> some other checks in the kernel still catch on to the fact that the directory
>>>>> is not writeable, so we have a premature failure with EPERM and server never sees
>>>>> this request which breaks things.
>>>> 
>>>> Are these VFS level checks? Which ones?
>>> 
>>> Yes, VFS level I believe.
>>> For Lustre it's may_create() from vfs_mkdir() that stops us short
>>> and the Lustre patch is effective.
>>> but for NFS this must be something else and I did not trace
>>> it completely. One of the security checks, I guess?
>>> if NFS patch is changed to check inode_permission(dir, MAY_OPEN | MAY_EXEC)
>>> as in Lustre, that returns 0 no matter what.
>>> 
>>> This is trivial to reproduce too.
>> 
>> So, should we be ignoring the MAY_EXEC flag when the VFS asks for inode_permission(dir, MAY_WRITE|MAY_EXEC)? I suspect that is the problem here.
> 
> You know, I have been wrong here.
> VFS is not in the way.
> it's the NFS server that's returning us the error:
> [  326.069066] NFS: mkdir(0:44/12), test
> [  326.070510] nfsd_dispatch: vers 3 proc 9
> [  326.071612] nfsd: MKDIR(3)    16: 01010001 00000000 0000000c 4b3b7a92 00000000 00000000 test
> [  326.071617] nfsd: fh_verify(16: 01010001 00000000 0000000c 4b3b7a92 00000000 00000000)
> [  326.071684] fh_verify: /racer permission failure, acc=3, error=13
> 
> So I guess nfsd wants to be lazy too?
> What side do you think this is best fixed then? the client or the server?

If the bug lies in the knfsd server, then we should fix it there. The Linux client isn’t the only one out there, and those other clients should also be able to depend on correct semantics from the server.

Cheers
  Trond

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index d8015a0..8c7835b 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1383,8 +1383,10 @@  struct dentry *nfs_lookup(struct inode *dir, struct dentry * dentry, unsigned in
 	/*
 	 * If we're doing an exclusive create, optimize away the lookup
 	 * but don't hash the dentry.
+	 * This optimization only works if we can write in the parent.
 	 */
-	if (nfs_is_exclusive_create(dir, flags))
+	if (nfs_is_exclusive_create(dir, flags) &&
+	    (inode_permission(dir, MAY_WRITE | MAY_READ | MAY_EXEC) == 0))
 		return NULL;
 
 	res = ERR_PTR(-ENOMEM);