Message ID | 20190419084810.63732-1-houtao1@huawei.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | dcache: ensure d_flags & d_inode are consistent in lookup_fast() | expand |
ping ? On 2019/4/19 16:48, Hou Tao wrote: > After extending the size of dentry from 192-bytes to 208-bytes > under aarch64, we got oops during the running of xfstests generic/429: > > Unable to handle kernel NULL pointer dereference at virtual address 0000000000000002 > CPU: 3 PID: 2725 Comm: t_encrypted_d_r Tainted: G D 5.1.0-rc4 > pc : inode_permission+0x28/0x160 > lr : link_path_walk.part.11+0x27c/0x528 > ...... > Call trace: > inode_permission+0x28/0x160 > link_path_walk.part.11+0x27c/0x528 > path_lookupat+0x64/0x208 > filename_lookup+0xa0/0x178 > user_path_at_empty+0x58/0x70 > vfs_statx+0x94/0x118 > __se_sys_newfstatat+0x58/0x98 > __arm64_sys_newfstatat+0x24/0x30 > el0_svc_common+0x7c/0x148 > el0_svc_handler+0x38/0x88 > el0_svc+0x8/0xc > > If we revert the size extension of dentry, the oops will be gone. > However if we just move the d_inode field from the begin of dentry > struct to the end of dentry struct (poorly simulate the way how > __randomize_layout works), the oops will reoccur. > > The following scenario illustrates the problem: > > precondition: > * dentry A has just been unlinked and becomes a negative dentry > * dentry A is encrypted, so it has d_revalidate hook: fscrypt_d_revalidate() > * lookup process is looking A/file, and creation process is creating A > > lookup process: creation process: > > lookup_fast > __d_lookup_rcu returns dentry A > > d_revalidate returns -ECHILD > > d_revalidate again succeed > __d_set_inode_and_type > dentry->d_inode = inode > WRITE_ONCE(dentry->d_flags, flags) > > d_is_negative(dentry) return false > follow_managed doesn't nothing > // inconsistent with d_flags > d_backing_inode() return NULL > nd->inode = NULL > > may_lookup() > // oops occurs > inode_permission(nd->inode > > The root cause is the inconsistency between d_flags & d_inode > during the REF-walk in lookup_fast(): d_is_negative(dentry) > returns false, but d_backing_inode() still returns a NULL pointer. > > The RCU-walk path in lookup_fast() uses d_seq to ensure d_flags & d_inode > are consistent, and lookup_slow() use inode lock to ensure that, so only > the REF-walk path in lookup_fast() is problematic. > > Fixing it by adding a paired smp_rmb/smp_wmb between the reading/writing > of d_inode & d_flags to ensure the consistency. > > Signed-off-by: Hou Tao <houtao1@huawei.com> > --- > fs/dcache.c | 2 ++ > fs/namei.c | 7 +++++++ > 2 files changed, 9 insertions(+) > > diff --git a/fs/dcache.c b/fs/dcache.c > index aac41adf4743..1eb85f9fcb0f 100644 > --- a/fs/dcache.c > +++ b/fs/dcache.c > @@ -316,6 +316,8 @@ static inline void __d_set_inode_and_type(struct dentry *dentry, > unsigned flags; > > dentry->d_inode = inode; > + /* paired with smp_rmb() in lookup_fast() */ > + smp_wmb(); > flags = READ_ONCE(dentry->d_flags); > flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU); > flags |= type_flags; > diff --git a/fs/namei.c b/fs/namei.c > index dede0147b3f6..833f760c70b2 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -1628,6 +1628,13 @@ static int lookup_fast(struct nameidata *nd, > return -ENOENT; > } > > + /* > + * Paired with smp_wmb() in __d_set_inode_and_type() to ensure > + * d_backing_inode is not NULL after the checking of d_flags > + * in d_is_negative() completes. > + */ > + smp_rmb(); > + > path->mnt = mnt; > path->dentry = dentry; > err = follow_managed(path, nd); >
On Fri, Apr 19, 2019 at 04:48:10PM +0800, Hou Tao wrote: > The root cause is the inconsistency between d_flags & d_inode > during the REF-walk in lookup_fast(): d_is_negative(dentry) > returns false, but d_backing_inode() still returns a NULL pointer. > > The RCU-walk path in lookup_fast() uses d_seq to ensure d_flags & d_inode > are consistent, and lookup_slow() use inode lock to ensure that, so only > the REF-walk path in lookup_fast() is problematic. > > Fixing it by adding a paired smp_rmb/smp_wmb between the reading/writing > of d_inode & d_flags to ensure the consistency. The problem is real, but I'm not sure I like the proposed fix ;-/ We could simply use d_really_is_negative() there, avoiding all that mess. If and when we get around to whiteouts-in-dcache (i.e. if unionfs series gets resurrected), we can revisit that...
diff --git a/fs/dcache.c b/fs/dcache.c index aac41adf4743..1eb85f9fcb0f 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -316,6 +316,8 @@ static inline void __d_set_inode_and_type(struct dentry *dentry, unsigned flags; dentry->d_inode = inode; + /* paired with smp_rmb() in lookup_fast() */ + smp_wmb(); flags = READ_ONCE(dentry->d_flags); flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU); flags |= type_flags; diff --git a/fs/namei.c b/fs/namei.c index dede0147b3f6..833f760c70b2 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1628,6 +1628,13 @@ static int lookup_fast(struct nameidata *nd, return -ENOENT; } + /* + * Paired with smp_wmb() in __d_set_inode_and_type() to ensure + * d_backing_inode is not NULL after the checking of d_flags + * in d_is_negative() completes. + */ + smp_rmb(); + path->mnt = mnt; path->dentry = dentry; err = follow_managed(path, nd);
After extending the size of dentry from 192-bytes to 208-bytes under aarch64, we got oops during the running of xfstests generic/429: Unable to handle kernel NULL pointer dereference at virtual address 0000000000000002 CPU: 3 PID: 2725 Comm: t_encrypted_d_r Tainted: G D 5.1.0-rc4 pc : inode_permission+0x28/0x160 lr : link_path_walk.part.11+0x27c/0x528 ...... Call trace: inode_permission+0x28/0x160 link_path_walk.part.11+0x27c/0x528 path_lookupat+0x64/0x208 filename_lookup+0xa0/0x178 user_path_at_empty+0x58/0x70 vfs_statx+0x94/0x118 __se_sys_newfstatat+0x58/0x98 __arm64_sys_newfstatat+0x24/0x30 el0_svc_common+0x7c/0x148 el0_svc_handler+0x38/0x88 el0_svc+0x8/0xc If we revert the size extension of dentry, the oops will be gone. However if we just move the d_inode field from the begin of dentry struct to the end of dentry struct (poorly simulate the way how __randomize_layout works), the oops will reoccur. The following scenario illustrates the problem: precondition: * dentry A has just been unlinked and becomes a negative dentry * dentry A is encrypted, so it has d_revalidate hook: fscrypt_d_revalidate() * lookup process is looking A/file, and creation process is creating A lookup process: creation process: lookup_fast __d_lookup_rcu returns dentry A d_revalidate returns -ECHILD d_revalidate again succeed __d_set_inode_and_type dentry->d_inode = inode WRITE_ONCE(dentry->d_flags, flags) d_is_negative(dentry) return false follow_managed doesn't nothing // inconsistent with d_flags d_backing_inode() return NULL nd->inode = NULL may_lookup() // oops occurs inode_permission(nd->inode The root cause is the inconsistency between d_flags & d_inode during the REF-walk in lookup_fast(): d_is_negative(dentry) returns false, but d_backing_inode() still returns a NULL pointer. The RCU-walk path in lookup_fast() uses d_seq to ensure d_flags & d_inode are consistent, and lookup_slow() use inode lock to ensure that, so only the REF-walk path in lookup_fast() is problematic. Fixing it by adding a paired smp_rmb/smp_wmb between the reading/writing of d_inode & d_flags to ensure the consistency. Signed-off-by: Hou Tao <houtao1@huawei.com> --- fs/dcache.c | 2 ++ fs/namei.c | 7 +++++++ 2 files changed, 9 insertions(+)