Message ID | 20180223201317.GG30522@ZenIV.linux.org.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Feb 23, 2018 at 12:13 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > Look: > dentry placed on a shrink list > we pick the fucker from the list and lock it. > we call lock_parent() on it. > dentry is not a root and it's not deleted, so we proceed. > trylock fails. > we grab rcu_read_lock() > we drop dentry->d_lock [ deleted the bad things ] Should we just instead get the ref to the dentry before dropping the lock, so that nobody else can get to dentry_kill? This is too subtle, and your fix to check d_lockref.count < 0 sounds wrong to me. If it's really gone, maybe it has been reused and the refcount is positive again, but it's something else than a dentry entirely? Hmm. No, you extended the rcu read section, so I guess your patch is fine. And lock_parent already has that pattern, soiit's not new. Ok, I agree, looks like lock_parent should just re-check that thing that it already checked earler, but that now might be true again because of we dropped d_lock. Linus
On Fri, Feb 23, 2018 at 01:35:52PM -0800, Linus Torvalds wrote: > This is too subtle, and your fix to check d_lockref.count < 0 sounds > wrong to me. If it's really gone, maybe it has been reused and the > refcount is positive again, but it's something else than a dentry > entirely? > > Hmm. > > No, you extended the rcu read section, so I guess your patch is fine. > And lock_parent already has that pattern, soiit's not new. > > Ok, I agree, looks like lock_parent should just re-check that thing > that it already checked earler, but that now might be true again > because of we dropped d_lock. IMO that's the right thing for backports; whether we keep it after the getting rid of trylock loops is a different question. Note that the only case where we do not have __dentry_kill() prevention guaranteed by the caller (either by holding a reference, or by holding onto ->i_lock all along) is in shrink_dentry_list(). And there we have more than enough of other subtle crap. Moreover, there we have a good reason to treat "it had been moved" as "kick it off the shrink list and free if it's already dead", which might simplify the things. Below is a stab at that: /* * ONLY for shrink_dentry_list() - it returns false if it finds * dentry grabbed, moved or killed, which is fine there but not * anywhere else. OTOH, nobody else needs to deal with dentries * getting killed under them. */ static bool shrink_lock_for_kill(struct dentry *dentry) { if (dentry->d_lockref.count) return false; inode = dentry->d_inode; if (inode && unlikely(!spin_trylock(&inode->i_lock))) { rcu_read_lock(); /* to protect inode */ spin_unlock(&dentry->d_lock); spin_lock(&inode->i_lock); spin_lock(&dentry->d_lock); if (unlikely(dentry->d_lockref.count)) goto out; /* changed inode means that somebody had grabbed it */ if (unlikely(inode != dentry->d_inode)) goto out; rcu_read_unlock(); } parent = dentry->d_parent; if (IS_ROOT(dentry) || likely(spin_trylock(&parent->d_lock))) return true; rcu_read_lock(); /* to protect parent */ spin_unlock(&dentry->d_lock); parent = READ_ONCE(dentry->d_parent); spin_lock(&parent->d_lock); if (unlikely(parent != dentry->d_parent)) { spin_unlock(&parent->d_lock); spin_lock(&dentry->d_lock); goto out; } spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED); if (likely(!dentry->d_lockref.count)) { rcu_read_unlock(); return true; } spin_unlock(&parent->d_lock); out: spin_unlock(&inode->i_lock); rcu_read_unlock(); return false; } static void shrink_dentry_list(struct list_head *list) { struct dentry *dentry, *parent; while (!list_empty(list)) { struct inode *inode; dentry = list_entry(list->prev, struct dentry, d_lru); spin_lock(&dentry->d_lock); if (!shrink_lock_for_kill(dentry)) { bool can_free = false; d_shrink_del(dentry); if (dentry->d_lockref.count < 0) can_free = dentry->d_flags & DCACHE_MAY_FREE; spin_unlock(&dentry->d_lock); if (can_free) dentry_free(dentry); continue; } d_shrink_del(dentry); parent = dentry->d_parent; __dentry_kill(dentry); if (dentry == parent) continue; dentry = parent; .... same as now .... } }
On Sat, Feb 24, 2018 at 12:22:48AM +0000, Al Viro wrote: > On Fri, Feb 23, 2018 at 01:35:52PM -0800, Linus Torvalds wrote: > > > This is too subtle, and your fix to check d_lockref.count < 0 sounds > > wrong to me. If it's really gone, maybe it has been reused and the > > refcount is positive again, but it's something else than a dentry > > entirely? > > > > Hmm. > > > > No, you extended the rcu read section, so I guess your patch is fine. > > And lock_parent already has that pattern, soiit's not new. > > > > Ok, I agree, looks like lock_parent should just re-check that thing > > that it already checked earler, but that now might be true again > > because of we dropped d_lock. > > IMO that's the right thing for backports; whether we keep it after > the getting rid of trylock loops is a different question. Note that > the only case where we do not have __dentry_kill() prevention > guaranteed by the caller (either by holding a reference, or by > holding onto ->i_lock all along) is in shrink_dentry_list(). > And there we have more than enough of other subtle crap. > > Moreover, there we have a good reason to treat "it had been moved" > as "kick it off the shrink list and free if it's already dead", > which might simplify the things. Below is a stab at that: FWIW, a variant of that series is in #work.dcache; it's almost certainly not the final (I want to clean the things up and probably reorder them as well) and it needs one hell of profiling and review. It seems to work, and I don't see any obvious performance regressions (everything seems to be within normal noise), but I'd like it beaten up a lot more. Parts are from John's series, parts are rewritten as discussed upthread. As the result, trylock loops are gone and no new retry loops had been added in their place - lock_parent() still has one, but that's it. I'll play with cleaning up and reordering tomorrow after I get some sleep. In the meanwhile, the current state of that set is at git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.dcache
On 2018-02-25, Al Viro <viro@ZenIV.linux.org.uk> wrote: > I'll play with cleaning up and reordering tomorrow after I get some > sleep. In the meanwhile, the current state of that set is at > git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.dcache I have one comment on your new dentry_kill()... > From e77db95d1ad189c76bcda7b90ac2225e98d776a2 Mon Sep 17 00:00:00 2001 > From: Al Viro <viro@zeniv.linux.org.uk> > Date: Fri, 23 Feb 2018 21:25:42 -0500 > Subject: [PATCH] get rid of trylock loop around dentry_kill() > > In case when trylock in there fails, deal with it directly in > dentry_kill(). Note that in cases when we drop and retake > ->d_lock, we need to recheck whether to retain the dentry. > Another thing is that dropping/retaking ->d_lock might have > ended up with negative dentry turning into positive; that, > of course, can happen only once... > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- > fs/dcache.c | 34 +++++++++++++++++++++++++++------- > 1 file changed, 27 insertions(+), 7 deletions(-) > > diff --git a/fs/dcache.c b/fs/dcache.c > index 449c1a5..c1f082d 100644 > --- a/fs/dcache.c > +++ b/fs/dcache.c > @@ -657,23 +657,43 @@ static struct dentry *dentry_kill(struct dentry *dentry) > struct dentry *parent = NULL; > > if (inode && unlikely(!spin_trylock(&inode->i_lock))) > - goto failed; > + goto slow_positive; > > if (!IS_ROOT(dentry)) { > parent = dentry->d_parent; > if (unlikely(!spin_trylock(&parent->d_lock))) { > - if (inode) > - spin_unlock(&inode->i_lock); > - goto failed; > + parent = __lock_parent(dentry); > + if (likely(inode || !dentry->d_inode)) > + goto got_locks; > + /* negative that became positive */ > + if (parent) > + spin_unlock(&parent->d_lock); > + inode = dentry->d_inode; > + goto slow_positive; > } > } > - > __dentry_kill(dentry); > return parent; > > -failed: > +slow_positive: > + spin_unlock(&dentry->d_lock); > + spin_lock(&inode->i_lock); > + spin_lock(&dentry->d_lock); > + parent = lock_parent(dentry); > +got_locks: > + if (likely(dentry->d_lockref.count == 1 && !retain_dentry(dentry))) { > + __dentry_kill(dentry); > + return parent; > + } > + /* we are keeping it, after all */ > + if (inode) > + spin_unlock(&inode->i_lock); > + if (parent) > + spin_unlock(&parent->d_lock); If someone else has grabbed a reference, it shouldn't be added to the lru list. Only decremented. if (entry->d_lockref.count == 1) > + dentry_lru_add(dentry); > + dentry->d_lockref.count--; > spin_unlock(&dentry->d_lock); > - return dentry; /* try again with same dentry */ > + return NULL; > } > > /*
On 2018-02-25 07:40:05 [+0000], Al Viro wrote: > FWIW, a variant of that series is in #work.dcache; it's > almost certainly not the final (I want to clean the things up > and probably reorder them as well) and it needs one hell of > profiling and review. It seems to work, and I don't see any > obvious performance regressions (everything seems to be within > normal noise), but I'd like it beaten up a lot more. If you have something in mind, I could fire up something. > I'll play with cleaning up and reordering tomorrow after I get > some sleep. In the meanwhile, the current state of that set is at > git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.dcache I've put it in -RT, let it do some upgrades and other things and nothing bad happened so far. Sebastian
On Tue, Feb 27, 2018 at 06:16:28AM +0100, John Ogness wrote: > If someone else has grabbed a reference, it shouldn't be added to the > lru list. Only decremented. > > if (entry->d_lockref.count == 1) Nah, better handle that in retain_dentry() itself. See updated #work.dcache. Note: another potentially fun thing in that branch is that I've finally decided to bite the bullet and make __d_move() preserve ->d_parent of target. Mainline: al@sonny:/tmp$ touch d al@sonny:/tmp$ sleep 100 >/tmp/a/b/c & [1] 16487 al@sonny:/tmp$ ls -l /proc/16487/fd total 0 lrwx------ 1 al al 64 Mar 12 11:33 0 -> /dev/pts/13 l-wx------ 1 al al 64 Mar 12 11:33 1 -> /tmp/a/b/c lrwx------ 1 al al 64 Mar 12 11:33 2 -> /dev/pts/13 al@sonny:/tmp$ mv /tmp/d /tmp/a/b/c al@sonny:/tmp$ ls -l /proc/16487/fd total 0 lrwx------ 1 al al 64 Mar 12 11:33 0 -> /dev/pts/13 l-wx------ 1 al al 64 Mar 12 11:33 1 -> /tmp/c (deleted) lrwx------ 1 al al 64 Mar 12 11:33 2 -> /dev/pts/13 With that branch: root@kvm1:/tmp# touch d root@kvm1:/tmp# sleep 100 >/tmp/a/b/c & [1] 2263 root@kvm1:/tmp# ls -l /proc/2263/fd total 0 lrwx------ 1 root root 64 Mar 12 11:33 0 -> /dev/pts/0 l-wx------ 1 root root 64 Mar 12 11:33 1 -> /tmp/a/b/c lrwx------ 1 root root 64 Mar 12 11:33 2 -> /dev/pts/0 root@kvm1:/tmp# mv /tmp/d /tmp/a/b/c root@kvm1:/tmp# ls -l /proc/2263/fd total 0 lrwx------ 1 root root 64 Mar 12 11:33 0 -> /dev/pts/0 l-wx------ 1 root root 64 Mar 12 11:33 1 -> '/tmp/a/b/c (deleted)' lrwx------ 1 root root 64 Mar 12 11:33 2 -> /dev/pts/0 It doesn't come quite for free; cross-directory d_move() and d_exchange() callers are responsible for having both parents pinned (all of them do that, mostly since the usual sequence is "look parents up, lock_rename(), *then* look children up, then do renaming"; those that are not part of rename(2) are also OK) and d_splice_alias() has become potentially blocking in one case. AFAICS, none of the callers is in locking environment that would not allow that. Survives the local beating and doesn't seem to cause any performance regressions. What we get out of that is a) much saner semantics for d_move() et.al. b) saner behaviour of d_path() (see above) c) dentry can be IS_ROOT only if it has been such all along; that simplifies the hell out of analysis. FWIW, there's another trylock loop on dentries - one in autofs get_next_positive_dentry(). Any plans re dealing with that one? I'd spent the last couple of weeks (when not being too sick for any work) going through dcache.c and related code; hopefully this time I will get the documentation into postable shape ;-/ There's an unpleasant area around the ->s_root vs. NFS. There's code that makes assumptions about ->s_root that are simply not true for NFS. Is path_connected() correct wrt NFS multiple imports from the same server? Ditto for mnt_already_visible() (that one might be mitigated at the moment, but probably won't last). Eric, am I missing something subtle in there?
On Mon, Mar 12, 2018 at 07:13:51PM +0000, Al Viro wrote: > There's an unpleasant area around the ->s_root vs. NFS. There's > code that makes assumptions about ->s_root that are simply not true > for NFS. Is path_connected() correct wrt NFS multiple imports from > the same server? Ditto for mnt_already_visible() (that one might > be mitigated at the moment, but probably won't last). Eric, am > I missing something subtle in there? BTW, another thing spotted while trying to document the stuff: Neil's "VFS: don't keep disconnected dentries on d_anon" has a subtle side effect I hadn't spotted when applying. Namely, for DCACHE_DISCONNECTED non-directory IS_ROOT dentries we now have d_unhashed() true. That makes d_find_alias() logics around discon_alias dead code: if (S_ISDIR(inode->i_mode) || !d_unhashed(alias)) { if (IS_ROOT(alias) && (alias->d_flags & DCACHE_DISCONNECTED)) { discon_alias = alias; } else { __dget_dlock(alias); spin_unlock(&alias->d_lock); return alias; } } Directory case is a red herring - we'll end up picking one and only alias, whether it's IS_ROOT/disconnected or not. For non-directories, though, IS_ROOT and disconnected implies d_unhashed() now, so we might as well turn that into if directory return d_find_any_alias else return the first hashed alias Does any of the d_find_alias() callers want those dentries? That is, non-directories from d_obtain_alias() still not attached to a parent; note that exportfs_decode_fh() is *not* one of such places - we don't use d_find_alias() there at all. If there's no such caller, we can bloody well just drop the discon_alias logics and be done with that; if there is, that commit has introduced a bug. I might have missed a part of threads related to that patch, so my apologies if it had been discussed. Neil, what's the situation there? A lot of those callers clearly treat the "only disconnected IS_ROOT alias exist" same as "no aliases at all"; it looks like the change might have been the right thing, but it sure as hell shouldn't be an undocumented side effect...
Al Viro <viro@ZenIV.linux.org.uk> writes: > On Tue, Feb 27, 2018 at 06:16:28AM +0100, John Ogness wrote: > >> If someone else has grabbed a reference, it shouldn't be added to the >> lru list. Only decremented. >> >> if (entry->d_lockref.count == 1) > > Nah, better handle that in retain_dentry() itself. See updated > #work.dcache. > > Note: another potentially fun thing in that branch is that I've > finally decided to bite the bullet and make __d_move() preserve > ->d_parent of target. > > Mainline: > al@sonny:/tmp$ touch d > al@sonny:/tmp$ sleep 100 >/tmp/a/b/c & > [1] 16487 > al@sonny:/tmp$ ls -l /proc/16487/fd > total 0 > lrwx------ 1 al al 64 Mar 12 11:33 0 -> /dev/pts/13 > l-wx------ 1 al al 64 Mar 12 11:33 1 -> /tmp/a/b/c > lrwx------ 1 al al 64 Mar 12 11:33 2 -> /dev/pts/13 > al@sonny:/tmp$ mv /tmp/d /tmp/a/b/c > al@sonny:/tmp$ ls -l /proc/16487/fd > total 0 > lrwx------ 1 al al 64 Mar 12 11:33 0 -> /dev/pts/13 > l-wx------ 1 al al 64 Mar 12 11:33 1 -> /tmp/c (deleted) > lrwx------ 1 al al 64 Mar 12 11:33 2 -> /dev/pts/13 > > With that branch: > root@kvm1:/tmp# touch d > root@kvm1:/tmp# sleep 100 >/tmp/a/b/c & > [1] 2263 > root@kvm1:/tmp# ls -l /proc/2263/fd > total 0 > lrwx------ 1 root root 64 Mar 12 11:33 0 -> /dev/pts/0 > l-wx------ 1 root root 64 Mar 12 11:33 1 -> /tmp/a/b/c > lrwx------ 1 root root 64 Mar 12 11:33 2 -> /dev/pts/0 > root@kvm1:/tmp# mv /tmp/d /tmp/a/b/c > root@kvm1:/tmp# ls -l /proc/2263/fd > total 0 > lrwx------ 1 root root 64 Mar 12 11:33 0 -> /dev/pts/0 > l-wx------ 1 root root 64 Mar 12 11:33 1 -> '/tmp/a/b/c (deleted)' > lrwx------ 1 root root 64 Mar 12 11:33 2 -> /dev/pts/0 > > It doesn't come quite for free; cross-directory d_move() > and d_exchange() callers are responsible for having both > parents pinned (all of them do that, mostly since the usual > sequence is "look parents up, lock_rename(), *then* look > children up, then do renaming"; those that are not part of > rename(2) are also OK) and d_splice_alias() has become potentially > blocking in one case. AFAICS, none of the callers is in > locking environment that would not allow that. Survives > the local beating and doesn't seem to cause any performance > regressions. > > What we get out of that is > a) much saner semantics for d_move() et.al. > b) saner behaviour of d_path() (see above) > c) dentry can be IS_ROOT only if it has been > such all along; that simplifies the hell out of analysis. > > FWIW, there's another trylock loop on dentries - one in > autofs get_next_positive_dentry(). Any plans re dealing > with that one? > > I'd spent the last couple of weeks (when not being too sick > for any work) going through dcache.c and related code; hopefully > this time I will get the documentation into postable shape ;-/ > > There's an unpleasant area around the ->s_root vs. NFS. There's > code that makes assumptions about ->s_root that are simply not true > for NFS. Is path_connected() correct wrt NFS multiple imports from > the same server? Ditto for mnt_already_visible() (that one might > be mitigated at the moment, but probably won't last). Eric, am > I missing something subtle in there? I don't have the entire context in my head. But I don't think we have problems today. NFS before it uses paths from an unconnected root in the rest of the vfs walks those paths backwards and makes the paths connect. I don't remember where all of that code that performs those connections but I do remember the code in fs/fhandle.c shares that code with nfs, to perform the same operation in open_by_handle_at. So I don't think the nfs peculiarities are actually relevant to anything on an ordinary code path. Of the two code paths you are concert about: For path path_connected looking at s_root is a heuristic to avoid calling is_subdir every time we need to do that check. If the heuristic fails we still have is_subdir which should remain accurate. If is_subdir fails the path is genuinely not connected at that moment and failing is the correct thing to do. For mnt_too_revealing the only filesystems under consideration are proc and sysfs. So nfs oddities are of no consequence. mnt_too_revealing probably won't be extended to other filesystems. Certainly nfs is not a candidate for having setting SB_I_USERNS_VISIBLE. Al is that sufficient to address your concerns? Eric
On Mon, Mar 12, 2018 at 08:05:41PM +0000, Al Viro wrote: > Does any of the d_find_alias() callers want those dentries? That is, > non-directories from d_obtain_alias() still not attached to a parent; > note that exportfs_decode_fh() is *not* one of such places - we don't > use d_find_alias() there at all. If there's no such caller, we can > bloody well just drop the discon_alias logics and be done with that; > if there is, that commit has introduced a bug. I might have missed > a part of threads related to that patch, so my apologies if it had > been discussed. > > Neil, what's the situation there? A lot of those callers clearly treat > the "only disconnected IS_ROOT alias exist" same as "no aliases at all"; > it looks like the change might have been the right thing, but it sure > as hell shouldn't be an undocumented side effect... FWIW, callers are drivers/staging/lustre/lustre/llite/llite_lib.c:2448: dentry = d_find_alias(page->mapping->host); definitely doesn't want them fs/ceph/caps.c:2945: while ((dn = d_find_alias(inode))) { fs/ceph/mds_client.c:1930: dentry = d_find_alias(inode); ditto - it's building a pathname. fs/ceph/mds_client.c:2910: dentry = d_find_alias(inode); ditto. fs/ceph/mds_client.c:3374: parent = d_find_alias(inode); clearly at least expects a directory (feeds to d_lookup(), etc.) fs/coda/cache.c:112: alias_de = d_find_alias(inode); directory-only fs/fat/namei_vfat.c:736: alias = d_find_alias(inode); we will reject any IS_ROOT there anyway fs/fs-writeback.c:2071: dentry = d_find_alias(inode); wants a name, these don't have one. fs/fuse/dir.c:966: dir = d_find_alias(parent); directory-only fs/hostfs/hostfs_kern.c:129: dentry = d_find_alias(ino); wants a pathname fs/nilfs2/super.c:931: dentry = d_find_alias(inode); directory-only fs/nilfs2/super.c:1025: dentry = d_find_alias(inode); bloody better be a directory (root one, at that) fs/xfs/xfs_filestream.c:293: dentry = d_find_alias(inode); no parent, no love. mm/shmem.c:3435: dentry = d_find_alias(inode); no d_obtain_alias() on that one security/commoncap.c:391: dentry = d_find_alias(inode); security/lsm_audit.c:295: dentry = d_find_alias(inode); wants a name security/selinux/hooks.c:1536: dentry = d_find_alias(inode); security/selinux/hooks.c:1646: dentry = d_find_alias(inode); So all but four of them clearly do *not* want those suckers. And remaining ones are in * ceph invalidate_aliases() * cap_inode_getsecurity() * selinux inode_doinit_with_dentry() (two call sites, very alike) Do any of those want that stuff?
On Mon, Mar 12, 2018 at 03:23:44PM -0500, Eric W. Biederman wrote: > Of the two code paths you are concert about: > > For path path_connected looking at s_root is a heuristic to avoid > calling is_subdir every time we need to do that check. If the heuristic > fails we still have is_subdir which should remain accurate. If > is_subdir fails the path is genuinely not connected at that moment > and failing is the correct thing to do. Umm... That might be not good enough - the logics is "everything's reachable from ->s_root anyway, so we might as well not bother checking". For NFS it's simply not true. We can mount server:/foo/bar/baz on /tmp/a, then server:/foo on /tmp/b and we'll have ->s_root pointing to a subtree of what's reachable at /tmp/b. Play with renames under /tmp/b and you just might end up with a problem. And mount on /tmp/a will be (mistakenly) considered to be safe, since it satisfies the heuristics in path_connected().
On Mon, 12 Mar 2018, Al Viro wrote: > On Tue, Feb 27, 2018 at 06:16:28AM +0100, John Ogness wrote: > What we get out of that is > a) much saner semantics for d_move() et.al. > b) saner behaviour of d_path() (see above) > c) dentry can be IS_ROOT only if it has been > such all along; that simplifies the hell out of analysis. Nice. > FWIW, there's another trylock loop on dentries - one in > autofs get_next_positive_dentry(). Any plans re dealing > with that one? It's on our list and we wanted to wait for the final resolution of dcache before tackling it. Do you have any immediate ide about that? > I'd spent the last couple of weeks (when not being too sick > for any work) going through dcache.c and related code; hopefully > this time I will get the documentation into postable shape ;-/ Documentation is surely welcome. Paging that code back in is a mindboggling exercise. Thanks, tglx
Al Viro <viro@ZenIV.linux.org.uk> writes: > On Mon, Mar 12, 2018 at 03:23:44PM -0500, Eric W. Biederman wrote: > >> Of the two code paths you are concert about: >> >> For path path_connected looking at s_root is a heuristic to avoid >> calling is_subdir every time we need to do that check. If the heuristic >> fails we still have is_subdir which should remain accurate. If >> is_subdir fails the path is genuinely not connected at that moment >> and failing is the correct thing to do. > > Umm... That might be not good enough - the logics is "everything's > reachable from ->s_root anyway, so we might as well not bother checking". > For NFS it's simply not true. If I am parsing the code correctly path_connected is broken for nfsv2 and nfsv3 when NFS_MOUNT_UNSHARED is not set. nfsv4 appears to make a kernel mount of the real root of the filesystem properly setting s_root and then finds the child it is mounting. > We can mount server:/foo/bar/baz on /tmp/a, then server:/foo on /tmp/b > and we'll have ->s_root pointing to a subtree of what's reachable at > /tmp/b. Play with renames under /tmp/b and you just might end up with > a problem. And mount on /tmp/a will be (mistakenly) considered to > be safe, since it satisfies the heuristics in path_connected(). Agreed. Which means that if you mount server:/foo/bar/baz first and then mount server:/foo with an appropriate rename you might be able to see all of server:/foo or possibly server:/ Hmm.. Given that nfs_kill_super calls generic_shutdown_super and generic_shutdown_super calls shrink_dcache_for_umount I would argue that nfsv2 and nfsv3 are buggy in the same case, as shrink_dcache_for_umount is called on something that is not the root of the filesystem's dentry tree. Eric
On Mon, Mar 12, 2018 at 06:28:30PM -0500, Eric W. Biederman wrote: > Given that nfs_kill_super calls generic_shutdown_super and > generic_shutdown_super calls shrink_dcache_for_umount > I would argue that nfsv2 and nfsv3 are buggy in the same > case, as shrink_dcache_for_umount is called on something > that is not the root of the filesystem's dentry tree. That's not a problem - it will simply evict a subtree first, then go through ->s_roots and kill each piece.
On Mon, Mar 12 2018, Al Viro wrote: > On Mon, Mar 12, 2018 at 07:13:51PM +0000, Al Viro wrote: > >> There's an unpleasant area around the ->s_root vs. NFS. There's >> code that makes assumptions about ->s_root that are simply not true >> for NFS. Is path_connected() correct wrt NFS multiple imports from >> the same server? Ditto for mnt_already_visible() (that one might >> be mitigated at the moment, but probably won't last). Eric, am >> I missing something subtle in there? > > BTW, another thing spotted while trying to document the stuff: > Neil's "VFS: don't keep disconnected dentries on d_anon" has a subtle > side effect I hadn't spotted when applying. Namely, for > DCACHE_DISCONNECTED non-directory IS_ROOT dentries we now have > d_unhashed() true. That makes d_find_alias() logics around > discon_alias dead code: > if (S_ISDIR(inode->i_mode) || !d_unhashed(alias)) { > if (IS_ROOT(alias) && > (alias->d_flags & DCACHE_DISCONNECTED)) { > discon_alias = alias; > } else { > __dget_dlock(alias); > spin_unlock(&alias->d_lock); > return alias; > } > } I agree that this code is now more complex than needed. > > Directory case is a red herring - we'll end up picking one and only > alias, whether it's IS_ROOT/disconnected or not. For non-directories, > though, IS_ROOT and disconnected implies d_unhashed() now, so we might > as well turn that into > if directory > return d_find_any_alias > else > return the first hashed alias Yes, this looks like a good simplification. > > Does any of the d_find_alias() callers want those dentries? That is, > non-directories from d_obtain_alias() still not attached to a parent; > note that exportfs_decode_fh() is *not* one of such places - we don't > use d_find_alias() there at all. If there's no such caller, we can > bloody well just drop the discon_alias logics and be done with that; > if there is, that commit has introduced a bug. I might have missed > a part of threads related to that patch, so my apologies if it had > been discussed. > > Neil, what's the situation there? A lot of those callers clearly treat > the "only disconnected IS_ROOT alias exist" same as "no aliases at all"; > it looks like the change might have been the right thing, but it sure > as hell shouldn't be an undocumented side effect... Many of the callers really want a name, which these disconnected dentries didn't have, so the new code is more correct in those cases. d_find_alias() is documented as returning a hashed dentry (for non-directories) and I suspect most people would be surprised to find that a disconnected dentry with no name was considered to be "hashed", so I think the patch could be seen as fixing the documentation. Of the three users that you didn't classify as "clearly" not wanting the disconnected dentries: > * ceph invalidate_aliases() This wants to ensure the dentries are discarded on final dput(). This is already the case for disconnected dentries, so it doesn't need to be told about them. Code is correct. > * cap_inode_getsecurity() If this gets invoked when the nfsd server calls vfs_getxattr() on a disconnected dentry, it will return -EINVAL. This looks like a potential bug. nfsd only calls vfs_getxattr() in nfsd4_is_junction() so this bug would not affect many use cases. I think cap_inode_getsecurity() should really be calling d_find_any_alias(). > * selinux inode_doinit_with_dentry() (two call sites, very alike) I'm less sure about this one, but I think it probably wants d_find_any_alias() as well. Like cap_inode_getsecurity() it only wants a dentry so that it can pass something to __vfs_getxattr(), and that only wants a dentry so it can pass something to ->get. Possibly we should rename d_find_alias() to d_find_hashed_alias() so that people need to make a conscious choice between d_find_hashed_alias() and d_find_any_alias() ?? Thanks, NeilBrown
On 2018-03-12, Al Viro <viro@ZenIV.linux.org.uk> wrote: >> If someone else has grabbed a reference, it shouldn't be added to the >> lru list. Only decremented. >> >> if (entry->d_lockref.count == 1) > > Nah, better handle that in retain_dentry() itself. See updated > #work.dcache. > > + if (unlikely(dentry->d_lockref.count != 1)) { > + dentry->d_lockref.count--; > + } else if (likely(!retain_dentry(dentry))) { > + __dentry_kill(dentry); > + return parent; > + } Although the updated version is correct (and saves on lines of code), I find putting the deref and lru_add code in the "true" case of retain_dentry() to be pretty tricky. IMHO the code is easier to understand if it looks like this: if (unlikely(dentry->d_lockref.count != 1)) { dentry->d_lockref.count--; } else if (likely(!retain_dentry(dentry))) { __dentry_kill(dentry); return parent; } else { dentry->d_lockref.count--; dentry_lru_add(dentry); } This is what your version is doing, but that final else is hiding in the retain_dentry() "true" case. My suggestion is to revert 7479f57fecd2a4837b5c79ce1cf0dcf284db54be (and then fixup dput() to deref before calling dentry_lru_add()). > FWIW, there's another trylock loop on dentries - one in > autofs get_next_positive_dentry(). Any plans re dealing > with that one? I will need to dig into it a bit deeper (I am unfamiliar with autofs), but it looks like it is trying to do basically the same thing as the ascend loop in d_walk(). > I'd spent the last couple of weeks (when not being too sick > for any work) going through dcache.c and related code; hopefully > this time I will get the documentation into postable shape ;-/ Thank you for all your help in getting these changes cleaned and correctly implemented so quickly. I've reviewed your latest trylock loop removal patches and found only 1 minor issue. I'll post that separately. John Ogness
Hi Al, 1 minor issue on the new shrink_lock_dentry()... > From 121a8e0834862d2c5d88c95f8e6bc8984f195abf Mon Sep 17 00:00:00 2001 > From: Al Viro <viro@zeniv.linux.org.uk> > Date: Fri, 23 Feb 2018 21:54:18 -0500 > Subject: [PATCH] get rid of trylock loop in locking dentries on shrink > list > > In case of trylock failure don't re-add to the list - drop the locks > and carefully get them in the right order. For shrink_dentry_list(), > somebody having grabbed a reference to dentry means that we can > kick it off-list, so if we find dentry being modified under us we > don't need to play silly buggers with retries anyway - off the list > it is. > > The locking logics taken out into a helper of its own; lock_parent() > is no longer used for dentries that can be killed under us. > > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> > --- > fs/dcache.c | 103 ++++++++++++++++++++++++++++++++++++++---------------------- > 1 file changed, 66 insertions(+), 37 deletions(-) > > diff --git a/fs/dcache.c b/fs/dcache.c > index 1684b6b..58097fd 100644 > --- a/fs/dcache.c > +++ b/fs/dcache.c > @@ -974,56 +974,85 @@ void d_prune_aliases(struct inode *inode) > } > EXPORT_SYMBOL(d_prune_aliases); > > -static void shrink_dentry_list(struct list_head *list) > +/* > + * Lock a dentry from shrink list. > + * Note that dentry is *not* protected from concurrent dentry_kill(), > + * d_delete(), etc. It is protected from freeing (by the fact of > + * being on a shrink list), but everything else is fair game. > + * Return false if dentry has been disrupted or grabbed, leaving > + * the caller to kick it off-list. Otherwise, return true and have > + * that dentry's inode and parent both locked. > + */ > +static bool shrink_lock_dentry(struct dentry *dentry) > { > - struct dentry *dentry, *parent; > + struct inode *inode; > + struct dentry *parent; > > - while (!list_empty(list)) { > - struct inode *inode; > - dentry = list_entry(list->prev, struct dentry, d_lru); > + if (dentry->d_lockref.count) > + return false; > + > + inode = dentry->d_inode; > + if (inode && unlikely(!spin_trylock(&inode->i_lock))) { > + rcu_read_lock(); /* to protect inode */ > + spin_unlock(&dentry->d_lock); > + spin_lock(&inode->i_lock); > spin_lock(&dentry->d_lock); > - parent = lock_parent(dentry); > + if (unlikely(dentry->d_lockref.count)) > + goto out; > + /* changed inode means that somebody had grabbed it */ > + if (unlikely(inode != dentry->d_inode)) > + goto out; > + rcu_read_unlock(); > + } > > - /* > - * The dispose list is isolated and dentries are not accounted > - * to the LRU here, so we can simply remove it from the list > - * here regardless of whether it is referenced or not. > - */ > - d_shrink_del(dentry); > + parent = dentry->d_parent; > + if (IS_ROOT(dentry) || likely(spin_trylock(&parent->d_lock))) > + return true; > > - /* > - * We found an inuse dentry which was not removed from > - * the LRU because of laziness during lookup. Do not free it. > - */ > - if (dentry->d_lockref.count > 0) { > - spin_unlock(&dentry->d_lock); > - if (parent) > - spin_unlock(&parent->d_lock); > - continue; > - } > + rcu_read_lock(); /* to protect parent */ > + spin_unlock(&dentry->d_lock); > + parent = READ_ONCE(dentry->d_parent); The preceeding line should be removed. We already have a "parent" from before we did the most recent trylock(). > + spin_lock(&parent->d_lock); > + if (unlikely(parent != dentry->d_parent)) { > + spin_unlock(&parent->d_lock); > + spin_lock(&dentry->d_lock); > + goto out; > + } > + spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED); > + if (likely(!dentry->d_lockref.count)) { > + rcu_read_unlock(); > + return true; > + } > + spin_unlock(&parent->d_lock); > +out: > + spin_unlock(&inode->i_lock); > + rcu_read_unlock(); > + return false; > +} > > +static void shrink_dentry_list(struct list_head *list) > +{ > + while (!list_empty(list)) { > + struct dentry *dentry, *parent; > + struct inode *inode; > > - if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) { > - bool can_free = dentry->d_flags & DCACHE_MAY_FREE; > + dentry = list_entry(list->prev, struct dentry, d_lru); > + spin_lock(&dentry->d_lock); > + if (!shrink_lock_dentry(dentry)) { > + bool can_free = false; > + d_shrink_del(dentry); > + if (dentry->d_lockref.count < 0) > + can_free = dentry->d_flags & DCACHE_MAY_FREE; > spin_unlock(&dentry->d_lock); > - if (parent) > - spin_unlock(&parent->d_lock); > if (can_free) > dentry_free(dentry); > continue; > } > - > - inode = dentry->d_inode; > - if (inode && unlikely(!spin_trylock(&inode->i_lock))) { > - d_shrink_add(dentry, list); > - spin_unlock(&dentry->d_lock); > - if (parent) > - spin_unlock(&parent->d_lock); > - continue; > - } > - > + d_shrink_del(dentry); > + parent = dentry->d_parent; > __dentry_kill(dentry); > - > + if (parent == dentry) > + continue; > /* > * We need to prune ancestors too. This is necessary to prevent > * quadratic behavior of shrink_dcache_parent(), but is also John Ogness
On Tue, Mar 13, 2018 at 12:12:51PM +1100, NeilBrown wrote: > > * selinux inode_doinit_with_dentry() (two call sites, very alike) > > I'm less sure about this one, but I think it probably wants > d_find_any_alias() as well. Like cap_inode_getsecurity() it only wants > a dentry so that it can pass something to __vfs_getxattr(), > and that only wants a dentry so it can pass something to ->get. > > Possibly we should rename d_find_alias() to d_find_hashed_alias() so that > people need to make a conscious choice between d_find_hashed_alias() and > d_find_any_alias() ?? FWIW, it *is* a bug; this /* * this is can be hit on boot when a file is accessed * before the policy is loaded. When we load policy we * may find inodes that have no dentry on the * sbsec->isec_head list. No reason to complain as these * will get fixed up the next time we go through * inode_doinit with a dentry, before these inodes could * be used again by userspace. */ in selinux/hooks.c is flat-out wrong now. Sure, if you first load selinux policy after exporting something over NFS or letting attacker play with open-by-fhandle, you are past any help, but still... I disagree about going for d_find_any_alias() from the very beginning, BTW - we need to try it in case of d_find_alias() failure, but sufficiently crappy filesystem can bloody well fail to access xattrs via disconnected dentry.
diff --git a/fs/dcache.c b/fs/dcache.c index 7c38f39958bc..32aaab21e648 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -647,11 +647,16 @@ static inline struct dentry *lock_parent(struct dentry *dentry) spin_unlock(&parent->d_lock); goto again; } - rcu_read_unlock(); - if (parent != dentry) + if (parent != dentry) { spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED); - else + if (unlikely(dentry->d_lockref.count < 0)) { + spin_unlock(&parent->d_lock); + parent = NULL; + } + } else { parent = NULL; + } + rcu_read_unlock(); return parent; }