[RFC] - vfs: Null pointer dereference issue with symlink create and read of symlink
diff mbox series

Message ID 20190903115827.0A8A0A405B@b06wcsmtp001.portsmouth.uk.ibm.com
State New
Headers show
Series
  • [RFC] - vfs: Null pointer dereference issue with symlink create and read of symlink
Related show

Commit Message

Ritesh Harjani Sept. 3, 2019, 11:58 a.m. UTC
Hi Viro/All,

Could you please review below issue and it's proposed solutions.
If you could let me know which of the two you think will be a better 
approach to solve this or in case if you have any other better approach, 
I can prepare and submit a official patch with that.



Issue signature:-
  [NIP  : trailing_symlink+80]
  [LR   : trailing_symlink+1092]
  #4 [c00000198069bb70] trailing_symlink at c0000000004bae60  (unreliable)
  #5 [c00000198069bc00] path_openat at c0000000004bdd14
  #6 [c00000198069bc90] do_filp_open at c0000000004c0274
  #7 [c00000198069bdb0] do_sys_open at c00000000049b248
  #8 [c00000198069be30] system_call at c00000000000b388



Test case:-
shell-1 - "while [ 1 ]; do cat /gpfs/g1/testdir/file3; sleep 1; done"
shell-2 - "while [ 1 ]; do ln -s /gpfs/g1/testdir/file1 
/gpfs/g1/testdir/file3; sleep 1; rm /gpfs/g1/testdir/file3 sleep 1; done



Problem description:-
In some filesystems like GPFS below described scenario may happen on 
some platforms (Reported-By:- wugyuan)

Here, two threads are being run in 2 different shells. Thread-1(cat) 
does cat of the symlink and Thread-2(ln) is creating the symlink.

Now on any platform with GPFS like filesystem, if CPU does out-of-order 
execution (or any kind of re-ordering due compiler optimization?) in 
function __d_set_and_inode_type(), then we see a NULL pointer 
dereference due to inode->i_uid.

This happens because in lookup_fast in nonRCU path or say REF-walk (i.e. 
in else condition), we check d_is_negative() without any lock protection.
And since in __d_set_and_inode_type() re-ordering may happen in setting 
of dentry->type & dentry->inode => this means that there is this tiny 
window where things are going wrong.


(GPFS like):- Any FS with -inode_operations ->permission callback 
returning -ECHILD in case of (mask & MAY_NOT_BLOCK) may cause this 
problem to happen. (few e.g. found were - ocfs2, ceph, coda, afs)

int xxx_permission(struct inode *inode, int mask)
{
          if (mask & MAY_NOT_BLOCK)
                  return -ECHILD;
	<...>
}

Wugyuan(cc), could reproduce this problem with GPFS filesystem.
Since, I didn't have the GPFS setup, so I tried replicating on a native 
FS by forcing out-of-order execution in function 
__d_set_inode_and_type() and making sure we return -ECHILD in 
MAY_NOT_BLOCK case in ->permission operation for all inodes.

With above changes in kernel, I could as well hit this issue on a native 
FS too.

(basically what we observed is link_path_walk will do nonRCU(REF-walk) 
lookup due to may_lookup -> inode_permission return -ECHILD and then 
unlazy_walk drops the LOOKUP_RCU flag (nd->flag). After that below race 
is possible).



Sequence of events:-

Thread-2(Comm: ln)		Thread-1(Comm: cat)

				dentry = __d_lookup() //nonRCU

__d_set_and_inode_type() (Out-of-order execution)
	flags = READ_ONCE(dentry->d_flags);
	flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
	flags |= type_flags;
	WRITE_ONCE(dentry->d_flags, flags);

					
				if (unlikely(d_is_negative()) // fails
   					{}
				// since type is already updated in
				// Thread-2 in parallel but inode
				// not yet set.
				// d_is_negative returns false

				*inode = d_backing_inode(path->dentry);
				// means inode is still NULL

	dentry->d_inode = inode;
	
				trailing_symlink()
					may_follow_link()
						inode = nd->link_inode;
						// nd->link_inode = NULL
						//Then it crashes while
						//doing inode->i_uid
					
	



Approach-1:- using wmb()



Regards
Ritesh

Comments

Gao Xiang Sept. 3, 2019, 12:59 p.m. UTC | #1
On Tue, Sep 03, 2019 at 05:28:26PM +0530, Ritesh Harjani wrote:
> Hi Viro/All,
> 
> Could you please review below issue and it's proposed solutions.
> If you could let me know which of the two you think will be a better
> approach to solve this or in case if you have any other better approach, I
> can prepare and submit a official patch with that.
> 
> 
> 
> Issue signature:-
>  [NIP  : trailing_symlink+80]
>  [LR   : trailing_symlink+1092]
>  #4 [c00000198069bb70] trailing_symlink at c0000000004bae60  (unreliable)
>  #5 [c00000198069bc00] path_openat at c0000000004bdd14
>  #6 [c00000198069bc90] do_filp_open at c0000000004c0274
>  #7 [c00000198069bdb0] do_sys_open at c00000000049b248
>  #8 [c00000198069be30] system_call at c00000000000b388
> 
> 
> 
> Test case:-
> shell-1 - "while [ 1 ]; do cat /gpfs/g1/testdir/file3; sleep 1; done"
> shell-2 - "while [ 1 ]; do ln -s /gpfs/g1/testdir/file1
> /gpfs/g1/testdir/file3; sleep 1; rm /gpfs/g1/testdir/file3 sleep 1; done
> 
> 
> 
> Problem description:-
> In some filesystems like GPFS below described scenario may happen on some
> platforms (Reported-By:- wugyuan)
> 
> Here, two threads are being run in 2 different shells. Thread-1(cat) does
> cat of the symlink and Thread-2(ln) is creating the symlink.
> 
> Now on any platform with GPFS like filesystem, if CPU does out-of-order
> execution (or any kind of re-ordering due compiler optimization?) in
> function __d_set_and_inode_type(), then we see a NULL pointer dereference
> due to inode->i_uid.
> 
> This happens because in lookup_fast in nonRCU path or say REF-walk (i.e. in
> else condition), we check d_is_negative() without any lock protection.
> And since in __d_set_and_inode_type() re-ordering may happen in setting of
> dentry->type & dentry->inode => this means that there is this tiny window
> where things are going wrong.
> 
> 
> (GPFS like):- Any FS with -inode_operations ->permission callback returning
> -ECHILD in case of (mask & MAY_NOT_BLOCK) may cause this problem to happen.
> (few e.g. found were - ocfs2, ceph, coda, afs)
> 
> int xxx_permission(struct inode *inode, int mask)
> {
>          if (mask & MAY_NOT_BLOCK)
>                  return -ECHILD;
> 	<...>
> }
> 
> Wugyuan(cc), could reproduce this problem with GPFS filesystem.
> Since, I didn't have the GPFS setup, so I tried replicating on a native FS
> by forcing out-of-order execution in function __d_set_inode_and_type() and
> making sure we return -ECHILD in MAY_NOT_BLOCK case in ->permission
> operation for all inodes.
> 
> With above changes in kernel, I could as well hit this issue on a native FS
> too.
> 
> (basically what we observed is link_path_walk will do nonRCU(REF-walk)
> lookup due to may_lookup -> inode_permission return -ECHILD and then
> unlazy_walk drops the LOOKUP_RCU flag (nd->flag). After that below race is
> possible).
> 
> 
> 
> Sequence of events:-
> 
> Thread-2(Comm: ln)		Thread-1(Comm: cat)
> 
> 				dentry = __d_lookup() //nonRCU
> 
> __d_set_and_inode_type() (Out-of-order execution)
> 	flags = READ_ONCE(dentry->d_flags);
> 	flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
> 	flags |= type_flags;
> 	WRITE_ONCE(dentry->d_flags, flags);
> 
> 					
> 				if (unlikely(d_is_negative()) // fails
>   					{}
> 				// since type is already updated in
> 				// Thread-2 in parallel but inode
> 				// not yet set.
> 				// d_is_negative returns false
> 
> 				*inode = d_backing_inode(path->dentry);
> 				// means inode is still NULL
> 
> 	dentry->d_inode = inode;
> 	
> 				trailing_symlink()
> 					may_follow_link()
> 						inode = nd->link_inode;
> 						// nd->link_inode = NULL
> 						//Then it crashes while
> 						//doing inode->i_uid
> 					
> 	

It seems much similar to
https://lore.kernel.org/r/20190419084810.63732-1-houtao1@huawei.com/

Thanks,
Gao Xiang

> 
> 
> 
> Approach-1:- using wmb()
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index e88cf0554e65..966172df77ee 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -316,6 +316,7 @@ static inline void __d_set_inode_and_type(struct dentry
> *dentry,
>         unsigned flags;
> 
>         dentry->d_inode = inode;
> +       smp_wmb();
>         flags = READ_ONCE(dentry->d_flags);
>         flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
>         flags |= type_flags;
> 
> 
> 
> Approach-2:- using spin_lock(&dentry->d_lock);
> 
> Do you think lock should be a better approach, given that we are already
> in REF-walk mode. As per the Documentation, we should be able to take
> spin_lock(&dentry->d_lock) in Ref-walk mode whenever required?
> 
> 
> With smp_wmb(), if added, should add a small latency in both
> RCU/REF-walk. But should be able to cover all the cases in general related
> to dentry->type & dentry->inode ordering. Though, we don't have any other
> race conditions reported or tested, other than the one, mentioned in this
> email.
> 
> Confused :(
> 
> 
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index 209c51a5226c..a3145594da1c 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -1557,6 +1557,7 @@ static int lookup_fast(struct nameidata *nd,
>         struct dentry *dentry, *parent = nd->path.dentry;
>         int status = 1;
>         int err;
> +       bool negative;
> 
>         /*
>          * Rename seqlock is not required here because in the off chance
> @@ -1565,7 +1566,6 @@ static int lookup_fast(struct nameidata *nd,
>          */
>         if (nd->flags & LOOKUP_RCU) {
>                 unsigned seq;
> -               bool negative;
>                 dentry = __d_lookup_rcu(parent, &nd->last, &seq);
>                 if (unlikely(!dentry)) {
>                         if (unlazy_walk(nd))
> @@ -1623,7 +1623,11 @@ static int lookup_fast(struct nameidata *nd,
>                 dput(dentry);
>                 return status;
>         }
> -       if (unlikely(d_is_negative(dentry))) {
> +
> +       spin_lock(&dentry->d_lock);
> +       negative = d_is_negative(dentry);
> +       spin_unlock(&dentry->d_lock);
> +       if (unlikely(negative)) {
>                 dput(dentry);
>                 return -ENOENT;
>         }
> 
> 
> Regards
> Ritesh
>
Ritesh Harjani Sept. 3, 2019, 1:41 p.m. UTC | #2
On 9/3/19 6:29 PM, Gao Xiang wrote:
> On Tue, Sep 03, 2019 at 05:28:26PM +0530, Ritesh Harjani wrote:
>> Hi Viro/All,
>>
>> Could you please review below issue and it's proposed solutions.
>> If you could let me know which of the two you think will be a better
>> approach to solve this or in case if you have any other better approach, I
>> can prepare and submit a official patch with that.
>>
>>
>>
>> Issue signature:-
>>   [NIP  : trailing_symlink+80]
>>   [LR   : trailing_symlink+1092]
>>   #4 [c00000198069bb70] trailing_symlink at c0000000004bae60  (unreliable)
>>   #5 [c00000198069bc00] path_openat at c0000000004bdd14
>>   #6 [c00000198069bc90] do_filp_open at c0000000004c0274
>>   #7 [c00000198069bdb0] do_sys_open at c00000000049b248
>>   #8 [c00000198069be30] system_call at c00000000000b388
>>
>>
>>
>> Test case:-
>> shell-1 - "while [ 1 ]; do cat /gpfs/g1/testdir/file3; sleep 1; done"
>> shell-2 - "while [ 1 ]; do ln -s /gpfs/g1/testdir/file1
>> /gpfs/g1/testdir/file3; sleep 1; rm /gpfs/g1/testdir/file3 sleep 1; done
>>
>>
>>
>> Problem description:-
>> In some filesystems like GPFS below described scenario may happen on some
>> platforms (Reported-By:- wugyuan)
>>
>> Here, two threads are being run in 2 different shells. Thread-1(cat) does
>> cat of the symlink and Thread-2(ln) is creating the symlink.
>>
>> Now on any platform with GPFS like filesystem, if CPU does out-of-order
>> execution (or any kind of re-ordering due compiler optimization?) in
>> function __d_set_and_inode_type(), then we see a NULL pointer dereference
>> due to inode->i_uid.
>>
>> This happens because in lookup_fast in nonRCU path or say REF-walk (i.e. in
>> else condition), we check d_is_negative() without any lock protection.
>> And since in __d_set_and_inode_type() re-ordering may happen in setting of
>> dentry->type & dentry->inode => this means that there is this tiny window
>> where things are going wrong.
>>
>>
>> (GPFS like):- Any FS with -inode_operations ->permission callback returning
>> -ECHILD in case of (mask & MAY_NOT_BLOCK) may cause this problem to happen.
>> (few e.g. found were - ocfs2, ceph, coda, afs)
>>
>> int xxx_permission(struct inode *inode, int mask)
>> {
>>           if (mask & MAY_NOT_BLOCK)
>>                   return -ECHILD;
>> 	<...>
>> }
>>
>> Wugyuan(cc), could reproduce this problem with GPFS filesystem.
>> Since, I didn't have the GPFS setup, so I tried replicating on a native FS
>> by forcing out-of-order execution in function __d_set_inode_and_type() and
>> making sure we return -ECHILD in MAY_NOT_BLOCK case in ->permission
>> operation for all inodes.
>>
>> With above changes in kernel, I could as well hit this issue on a native FS
>> too.
>>
>> (basically what we observed is link_path_walk will do nonRCU(REF-walk)
>> lookup due to may_lookup -> inode_permission return -ECHILD and then
>> unlazy_walk drops the LOOKUP_RCU flag (nd->flag). After that below race is
>> possible).
>>
>>
>>
>> Sequence of events:-
>>
>> Thread-2(Comm: ln)		Thread-1(Comm: cat)
>>
>> 				dentry = __d_lookup() //nonRCU
>>
>> __d_set_and_inode_type() (Out-of-order execution)
>> 	flags = READ_ONCE(dentry->d_flags);
>> 	flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
>> 	flags |= type_flags;
>> 	WRITE_ONCE(dentry->d_flags, flags);
>>
>> 					
>> 				if (unlikely(d_is_negative()) // fails
>>    					{}
>> 				// since type is already updated in
>> 				// Thread-2 in parallel but inode
>> 				// not yet set.
>> 				// d_is_negative returns false
>>
>> 				*inode = d_backing_inode(path->dentry);
>> 				// means inode is still NULL
>>
>> 	dentry->d_inode = inode;
>> 	
>> 				trailing_symlink()
>> 					may_follow_link()
>> 						inode = nd->link_inode;
>> 						// nd->link_inode = NULL
>> 						//Then it crashes while
>> 						//doing inode->i_uid
>> 					
>> 	
> 
> It seems much similar to
> https://lore.kernel.org/r/20190419084810.63732-1-houtao1@huawei.com/

Thanks, yes two same symptoms with different use cases.
But except the fact that here, we see the issue with GPFS quite 
frequently. So let's hope that we could have some solution to this 
problem in upstream.

 From the thread:-
 >> We could simply use d_really_is_negative() there, avoiding all that
 >> mess.  If and when we get around to whiteouts-in-dcache (i.e. if
 >> unionfs series gets resurrected), we can revisit that

I didn't get this part. Does it mean, d_really_is_negative can only be 
used, once whiteouts-in-dcache series is resurrected?
If yes, meanwhile could we have any other solution in place?

-ritesh
Gao Xiang Sept. 3, 2019, 1:58 p.m. UTC | #3
On Tue, Sep 03, 2019 at 07:11:28PM +0530, Ritesh Harjani wrote:
> 
> 
> On 9/3/19 6:29 PM, Gao Xiang wrote:
> > On Tue, Sep 03, 2019 at 05:28:26PM +0530, Ritesh Harjani wrote:
> > > Hi Viro/All,
> > > 
> > > Could you please review below issue and it's proposed solutions.
> > > If you could let me know which of the two you think will be a better
> > > approach to solve this or in case if you have any other better approach, I
> > > can prepare and submit a official patch with that.
> > > 
> > > 
> > > 
> > > Issue signature:-
> > >   [NIP  : trailing_symlink+80]
> > >   [LR   : trailing_symlink+1092]
> > >   #4 [c00000198069bb70] trailing_symlink at c0000000004bae60  (unreliable)
> > >   #5 [c00000198069bc00] path_openat at c0000000004bdd14
> > >   #6 [c00000198069bc90] do_filp_open at c0000000004c0274
> > >   #7 [c00000198069bdb0] do_sys_open at c00000000049b248
> > >   #8 [c00000198069be30] system_call at c00000000000b388
> > > 
> > > 
> > > 
> > > Test case:-
> > > shell-1 - "while [ 1 ]; do cat /gpfs/g1/testdir/file3; sleep 1; done"
> > > shell-2 - "while [ 1 ]; do ln -s /gpfs/g1/testdir/file1
> > > /gpfs/g1/testdir/file3; sleep 1; rm /gpfs/g1/testdir/file3 sleep 1; done
> > > 
> > > 
> > > 
> > > Problem description:-
> > > In some filesystems like GPFS below described scenario may happen on some
> > > platforms (Reported-By:- wugyuan)
> > > 
> > > Here, two threads are being run in 2 different shells. Thread-1(cat) does
> > > cat of the symlink and Thread-2(ln) is creating the symlink.
> > > 
> > > Now on any platform with GPFS like filesystem, if CPU does out-of-order
> > > execution (or any kind of re-ordering due compiler optimization?) in
> > > function __d_set_and_inode_type(), then we see a NULL pointer dereference
> > > due to inode->i_uid.
> > > 
> > > This happens because in lookup_fast in nonRCU path or say REF-walk (i.e. in
> > > else condition), we check d_is_negative() without any lock protection.
> > > And since in __d_set_and_inode_type() re-ordering may happen in setting of
> > > dentry->type & dentry->inode => this means that there is this tiny window
> > > where things are going wrong.
> > > 
> > > 
> > > (GPFS like):- Any FS with -inode_operations ->permission callback returning
> > > -ECHILD in case of (mask & MAY_NOT_BLOCK) may cause this problem to happen.
> > > (few e.g. found were - ocfs2, ceph, coda, afs)
> > > 
> > > int xxx_permission(struct inode *inode, int mask)
> > > {
> > >           if (mask & MAY_NOT_BLOCK)
> > >                   return -ECHILD;
> > > 	<...>
> > > }
> > > 
> > > Wugyuan(cc), could reproduce this problem with GPFS filesystem.
> > > Since, I didn't have the GPFS setup, so I tried replicating on a native FS
> > > by forcing out-of-order execution in function __d_set_inode_and_type() and
> > > making sure we return -ECHILD in MAY_NOT_BLOCK case in ->permission
> > > operation for all inodes.
> > > 
> > > With above changes in kernel, I could as well hit this issue on a native FS
> > > too.
> > > 
> > > (basically what we observed is link_path_walk will do nonRCU(REF-walk)
> > > lookup due to may_lookup -> inode_permission return -ECHILD and then
> > > unlazy_walk drops the LOOKUP_RCU flag (nd->flag). After that below race is
> > > possible).
> > > 
> > > 
> > > 
> > > Sequence of events:-
> > > 
> > > Thread-2(Comm: ln)		Thread-1(Comm: cat)
> > > 
> > > 				dentry = __d_lookup() //nonRCU
> > > 
> > > __d_set_and_inode_type() (Out-of-order execution)
> > > 	flags = READ_ONCE(dentry->d_flags);
> > > 	flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
> > > 	flags |= type_flags;
> > > 	WRITE_ONCE(dentry->d_flags, flags);
> > > 
> > > 					
> > > 				if (unlikely(d_is_negative()) // fails
> > >    					{}
> > > 				// since type is already updated in
> > > 				// Thread-2 in parallel but inode
> > > 				// not yet set.
> > > 				// d_is_negative returns false
> > > 
> > > 				*inode = d_backing_inode(path->dentry);
> > > 				// means inode is still NULL
> > > 
> > > 	dentry->d_inode = inode;
> > > 	
> > > 				trailing_symlink()
> > > 					may_follow_link()
> > > 						inode = nd->link_inode;
> > > 						// nd->link_inode = NULL
> > > 						//Then it crashes while
> > > 						//doing inode->i_uid
> > > 					
> > > 	
> > 
> > It seems much similar to
> > https://lore.kernel.org/r/20190419084810.63732-1-houtao1@huawei.com/
> 
> Thanks, yes two same symptoms with different use cases.
> But except the fact that here, we see the issue with GPFS quite frequently.
> So let's hope that we could have some solution to this problem in upstream.
> 
> From the thread:-
> >> We could simply use d_really_is_negative() there, avoiding all that
> >> mess.  If and when we get around to whiteouts-in-dcache (i.e. if
> >> unionfs series gets resurrected), we can revisit that
> 
> I didn't get this part. Does it mean, d_really_is_negative can only be used,
> once whiteouts-in-dcache series is resurrected?
> If yes, meanwhile could we have any other solution in place?

In my own premature opinion, I think it's some complicated about
the coexistence of d_is_negative() and d_really_is_negative(),
and handle both d_flags and d_inode stuffs for negative dentries...

No constructive idea here... Just same case found by our colleagues...

Thanks,
Gao Xiang


> 
> -ritesh
>
Jeff Layton Sept. 4, 2019, 2:39 p.m. UTC | #4
On Tue, 2019-09-03 at 19:11 +0530, Ritesh Harjani wrote:
> 
> On 9/3/19 6:29 PM, Gao Xiang wrote:
> > On Tue, Sep 03, 2019 at 05:28:26PM +0530, Ritesh Harjani wrote:
> > > Hi Viro/All,
> > > 
> > > Could you please review below issue and it's proposed solutions.
> > > If you could let me know which of the two you think will be a better
> > > approach to solve this or in case if you have any other better approach, I
> > > can prepare and submit a official patch with that.
> > > 
> > > 
> > > 
> > > Issue signature:-
> > >   [NIP  : trailing_symlink+80]
> > >   [LR   : trailing_symlink+1092]
> > >   #4 [c00000198069bb70] trailing_symlink at c0000000004bae60  (unreliable)
> > >   #5 [c00000198069bc00] path_openat at c0000000004bdd14
> > >   #6 [c00000198069bc90] do_filp_open at c0000000004c0274
> > >   #7 [c00000198069bdb0] do_sys_open at c00000000049b248
> > >   #8 [c00000198069be30] system_call at c00000000000b388
> > > 
> > > 
> > > 
> > > Test case:-
> > > shell-1 - "while [ 1 ]; do cat /gpfs/g1/testdir/file3; sleep 1; done"
> > > shell-2 - "while [ 1 ]; do ln -s /gpfs/g1/testdir/file1
> > > /gpfs/g1/testdir/file3; sleep 1; rm /gpfs/g1/testdir/file3 sleep 1; done
> > > 
> > > 
> > > 
> > > Problem description:-
> > > In some filesystems like GPFS below described scenario may happen on some
> > > platforms (Reported-By:- wugyuan)
> > > 
> > > Here, two threads are being run in 2 different shells. Thread-1(cat) does
> > > cat of the symlink and Thread-2(ln) is creating the symlink.
> > > 
> > > Now on any platform with GPFS like filesystem, if CPU does out-of-order
> > > execution (or any kind of re-ordering due compiler optimization?) in
> > > function __d_set_and_inode_type(), then we see a NULL pointer dereference
> > > due to inode->i_uid.
> > > 
> > > This happens because in lookup_fast in nonRCU path or say REF-walk (i.e. in
> > > else condition), we check d_is_negative() without any lock protection.
> > > And since in __d_set_and_inode_type() re-ordering may happen in setting of
> > > dentry->type & dentry->inode => this means that there is this tiny window
> > > where things are going wrong.
> > > 
> > > 
> > > (GPFS like):- Any FS with -inode_operations ->permission callback returning
> > > -ECHILD in case of (mask & MAY_NOT_BLOCK) may cause this problem to happen.
> > > (few e.g. found were - ocfs2, ceph, coda, afs)
> > > 
> > > int xxx_permission(struct inode *inode, int mask)
> > > {
> > >           if (mask & MAY_NOT_BLOCK)
> > >                   return -ECHILD;
> > > 	<...>
> > > }
> > > 
> > > Wugyuan(cc), could reproduce this problem with GPFS filesystem.
> > > Since, I didn't have the GPFS setup, so I tried replicating on a native FS
> > > by forcing out-of-order execution in function __d_set_inode_and_type() and
> > > making sure we return -ECHILD in MAY_NOT_BLOCK case in ->permission
> > > operation for all inodes.
> > > 
> > > With above changes in kernel, I could as well hit this issue on a native FS
> > > too.
> > > 
> > > (basically what we observed is link_path_walk will do nonRCU(REF-walk)
> > > lookup due to may_lookup -> inode_permission return -ECHILD and then
> > > unlazy_walk drops the LOOKUP_RCU flag (nd->flag). After that below race is
> > > possible).
> > > 
> > > 
> > > 
> > > Sequence of events:-
> > > 
> > > Thread-2(Comm: ln)		Thread-1(Comm: cat)
> > > 
> > > 				dentry = __d_lookup() //nonRCU
> > > 
> > > __d_set_and_inode_type() (Out-of-order execution)
> > > 	flags = READ_ONCE(dentry->d_flags);
> > > 	flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
> > > 	flags |= type_flags;
> > > 	WRITE_ONCE(dentry->d_flags, flags);
> > > 
> > > 					
> > > 				if (unlikely(d_is_negative()) // fails
> > >    					{}
> > > 				// since type is already updated in
> > > 				// Thread-2 in parallel but inode
> > > 				// not yet set.
> > > 				// d_is_negative returns false
> > > 
> > > 				*inode = d_backing_inode(path->dentry);
> > > 				// means inode is still NULL
> > > 
> > > 	dentry->d_inode = inode;
> > > 	
> > > 				trailing_symlink()
> > > 					may_follow_link()
> > > 						inode = nd->link_inode;
> > > 						// nd->link_inode = NULL
> > > 						//Then it crashes while
> > > 						//doing inode->i_uid
> > > 					
> > > 	
> > 
> > It seems much similar to
> > https://lore.kernel.org/r/20190419084810.63732-1-houtao1@huawei.com/
> 
> Thanks, yes two same symptoms with different use cases.
> But except the fact that here, we see the issue with GPFS quite 
> frequently. So let's hope that we could have some solution to this 
> problem in upstream.
> 

Agreed. Looks a lot like the same issue.

>  From the thread:-
>  >> We could simply use d_really_is_negative() there, avoiding all that
>  >> mess.  If and when we get around to whiteouts-in-dcache (i.e. if
>  >> unionfs series gets resurrected), we can revisit that
> 
> I didn't get this part. Does it mean, d_really_is_negative can only be 
> used, once whiteouts-in-dcache series is resurrected?
> If yes, meanwhile could we have any other solution in place?
> 

I think Al was saying that you could change this to use
d_really_is_negative now but the whiteouts-in-dcache series would have
to deal with that. That series seems to be stalled for the time being,
so I wouldn't let it stop you from changing this.

It wouldn't hurt to put in some comments with this change too, to make
sure everyone understands why that's being used there.
Ritesh Harjani Sept. 6, 2019, 5:17 a.m. UTC | #5
On 9/4/19 8:09 PM, Jeff Layton wrote:
>>> It seems much similar to
>>> https://lore.kernel.org/r/20190419084810.63732-1-houtao1@huawei.com/
>>
>> Thanks, yes two same symptoms with different use cases.
>> But except the fact that here, we see the issue with GPFS quite
>> frequently. So let's hope that we could have some solution to this
>> problem in upstream.
>>
> 
> Agreed. Looks a lot like the same issue.
> 
>>   From the thread:-
>>   >> We could simply use d_really_is_negative() there, avoiding all that
>>   >> mess.  If and when we get around to whiteouts-in-dcache (i.e. if
>>   >> unionfs series gets resurrected), we can revisit that
>>
>> I didn't get this part. Does it mean, d_really_is_negative can only be
>> used, once whiteouts-in-dcache series is resurrected?
>> If yes, meanwhile could we have any other solution in place?
>>
> 
> I think Al was saying that you could change this to use
> d_really_is_negative now but the whiteouts-in-dcache series would have
> to deal with that. That series seems to be stalled for the time being,
> so I wouldn't let it stop you from changing this.
> 
> It wouldn't hurt to put in some comments with this change too, to make
> sure everyone understands why that's being used there.

Thanks Jeff. I did look at the thread again and yes you are right.
I have created the patch and added some comments as per your suggestion. 
Once it is tested again internally (although it does looks like it 
should work fine), I will post the patch.

Appreciate your help!!

-ritesh

Patch
diff mbox series

diff --git a/fs/dcache.c b/fs/dcache.c
index e88cf0554e65..966172df77ee 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -316,6 +316,7 @@  static inline void __d_set_inode_and_type(struct 
dentry *dentry,
         unsigned flags;

         dentry->d_inode = inode;
+       smp_wmb();
         flags = READ_ONCE(dentry->d_flags);
         flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU);
         flags |= type_flags;



Approach-2:- using spin_lock(&dentry->d_lock);

Do you think lock should be a better approach, given that we are already
in REF-walk mode. As per the Documentation, we should be able to take
spin_lock(&dentry->d_lock) in Ref-walk mode whenever required?


With smp_wmb(), if added, should add a small latency in both
RCU/REF-walk. But should be able to cover all the cases in general 
related to dentry->type & dentry->inode ordering. Though, we don't have 
any other race conditions reported or tested, other than the one, 
mentioned in this email.

Confused :(



diff --git a/fs/namei.c b/fs/namei.c
index 209c51a5226c..a3145594da1c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1557,6 +1557,7 @@  static int lookup_fast(struct nameidata *nd,
         struct dentry *dentry, *parent = nd->path.dentry;
         int status = 1;
         int err;
+       bool negative;

         /*
          * Rename seqlock is not required here because in the off chance
@@ -1565,7 +1566,6 @@  static int lookup_fast(struct nameidata *nd,
          */
         if (nd->flags & LOOKUP_RCU) {
                 unsigned seq;
-               bool negative;
                 dentry = __d_lookup_rcu(parent, &nd->last, &seq);
                 if (unlikely(!dentry)) {
                         if (unlazy_walk(nd))
@@ -1623,7 +1623,11 @@  static int lookup_fast(struct nameidata *nd,
                 dput(dentry);
                 return status;
         }
-       if (unlikely(d_is_negative(dentry))) {
+
+       spin_lock(&dentry->d_lock);
+       negative = d_is_negative(dentry);
+       spin_unlock(&dentry->d_lock);
+       if (unlikely(negative)) {
                 dput(dentry);
                 return -ENOENT;
         }