diff mbox

[RFC] rmdir(),rename(): do shrink_dcache_parent() only on success

Message ID 20180527222029.GR30522@ZenIV.linux.org.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Al Viro May 27, 2018, 10:20 p.m. UTC
Once upon a time ->rmdir() instances used to check if victim inode
had more than one (in-core) reference and failed with -EBUSY if it
had.  The reason was race avoidance - emptiness check is worthless
if somebody could just go and create new objects in the victim
directory afterwards.

With introduction of dcache the checks had been replaced with
checking the refcount of dentry.  However, since a cached negative
lookup leaves a negative child dentry, such check had lead to false
positives - with empty foo/ doing stat foo/bar before rmdir foo
ended up with -EBUSY unless the negative dentry of foo/bar happened
to be evicted by the time of rmdir(2).  That had been fixed by
doing shrink_dcache_parent() just before the refcount check.

At the same time, ext2_rmdir() has grown a private solution that
eliminated those -EBUSY - it did something (setting ->i_size to 0)
which made any subsequent ext2_add_entry() fail.

Unfortunately, even with shrink_dcache_parent() the check had been
racy - after all, the victim itself could be found by dcache lookup
just after we'd checked its refcount.  That got fixed by a new
helper (dentry_unhash()) that did shrink_dcache_parent() and unhashed
the sucker if its refcount ended up equal to 1.  That got called before
->rmdir(), turning the checks in ->rmdir() instances into "if not
unhashed fail with -EBUSY".  Which reduced the boilerplate nicely, but
had an unpleasant side effect - now shrink_dcache_parent() had been
done before the emptiness checks, leading to easily triggerable calls
of shrink_dcache_parent() on arbitrary large subtrees, quite possibly
nested into each other.

Several years later the ext2-private trick had been generalized -
(in-core) inodes of dead directories are flagged and calls of
lookup, readdir and all directory-modifying methods were prevented
in so marked directories.  Remaining boilerplate in ->rmdir() instances
became redundant and some instances got rid of it.

In 2011 the call of dentry_unhash() got shifted into ->rmdir() instances
and then killed off in all of them.  That has lead to another problem,
though - in case of successful rmdir we *want* any (negative) child
dentries dropped and the victim itself made negative.  There's no point
keeping cached negative lookups in foo when we can get the negative
lookup of foo itself cached.  So shrink_dcache_parent() call had been
restored; unfortunately, it went into the place where dentry_unhash()
used to be, i.e. before the ->rmdir() call.  Note that we don't unhash
anymore, so any "is it busy" checks would be racy; fortunately, all of
them are gone.

We should've done that call right *after* successful ->rmdir().  That
reduces contention caused by tree-walking in shrink_dcache_parent()
and, especially, contention caused by evictions in two nested subtrees
going on in parallel.  The same goes for directory-overwriting rename() -
the story there had been parallel to that of rmdir().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---

Comments

Linus Torvalds May 27, 2018, 11:04 p.m. UTC | #1
On Sun, May 27, 2018 at 3:20 PM Al Viro <viro@zeniv.linux.org.uk> wrote:

> We should've done that call right *after* successful ->rmdir().  That
> reduces contention caused by tree-walking in shrink_dcache_parent()
> and, especially, contention caused by evictions in two nested subtrees
> going on in parallel.  The same goes for directory-overwriting rename() -
> the story there had been parallel to that of rmdir().

Looks eminently sane to me.

               Linus
Andreas Dilger May 28, 2018, 2:14 a.m. UTC | #2
On May 27, 2018, at 4:20 PM, Al Viro <viro@ZenIV.linux.org.uk> wrote:
> 
> Once upon a time ->rmdir() instances used to check if victim inode
> had more than one (in-core) reference and failed with -EBUSY if it
> had.  The reason was race avoidance - emptiness check is worthless
> if somebody could just go and create new objects in the victim
> directory afterwards.
> 
> With introduction of dcache the checks had been replaced with
> checking the refcount of dentry.  However, since a cached negative
> lookup leaves a negative child dentry, such check had lead to false
> positives - with empty foo/ doing stat foo/bar before rmdir foo
> ended up with -EBUSY unless the negative dentry of foo/bar happened
> to be evicted by the time of rmdir(2).  That had been fixed by
> doing shrink_dcache_parent() just before the refcount check.
> 
> At the same time, ext2_rmdir() has grown a private solution that
> eliminated those -EBUSY - it did something (setting ->i_size to 0)
> which made any subsequent ext2_add_entry() fail.
> 
> Unfortunately, even with shrink_dcache_parent() the check had been
> racy - after all, the victim itself could be found by dcache lookup
> just after we'd checked its refcount.  That got fixed by a new
> helper (dentry_unhash()) that did shrink_dcache_parent() and unhashed
> the sucker if its refcount ended up equal to 1.  That got called before
> ->rmdir(), turning the checks in ->rmdir() instances into "if not
> unhashed fail with -EBUSY".  Which reduced the boilerplate nicely, but
> had an unpleasant side effect - now shrink_dcache_parent() had been
> done before the emptiness checks, leading to easily triggerable calls
> of shrink_dcache_parent() on arbitrary large subtrees, quite possibly
> nested into each other.
> 
> Several years later the ext2-private trick had been generalized -
> (in-core) inodes of dead directories are flagged and calls of
> lookup, readdir and all directory-modifying methods were prevented
> in so marked directories.  Remaining boilerplate in ->rmdir() instances
> became redundant and some instances got rid of it.
> 
> In 2011 the call of dentry_unhash() got shifted into ->rmdir() instances
> and then killed off in all of them.  That has lead to another problem,
> though - in case of successful rmdir we *want* any (negative) child
> dentries dropped and the victim itself made negative.  There's no point
> keeping cached negative lookups in foo when we can get the negative
> lookup of foo itself cached.  So shrink_dcache_parent() call had been
> restored; unfortunately, it went into the place where dentry_unhash()
> used to be, i.e. before the ->rmdir() call.  Note that we don't unhash
> anymore, so any "is it busy" checks would be racy; fortunately, all of
> them are gone.
> 
> We should've done that call right *after* successful ->rmdir().  That
> reduces contention caused by tree-walking in shrink_dcache_parent()
> and, especially, contention caused by evictions in two nested subtrees
> going on in parallel.  The same goes for directory-overwriting rename() -
> the story there had been parallel to that of rmdir().
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
> diff --git a/fs/namei.c b/fs/namei.c
> index 186bd2464fd5..269b64a1121a 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3847,11 +3847,11 @@ int vfs_rmdir(struct inode *dir, struct dentry *dentry)
> 	if (error)
> 		goto out;
> 
> -	shrink_dcache_parent(dentry);
> 	error = dir->i_op->rmdir(dir, dentry);
> 	if (error)
> 		goto out;
> 
> +	shrink_dcache_parent(dentry);
> 	dentry->d_inode->i_flags |= S_DEAD;
> 	dont_mount(dentry);
> 	detach_mounts(dentry);
> @@ -4434,8 +4434,6 @@ int vfs_rename(struct inode *old_dir, struct dentry *old_dentry,
> 		    old_dir->i_nlink >= max_links)
> 			goto out;
> 	}
> -	if (is_dir && !(flags & RENAME_EXCHANGE) && target)
> -		shrink_dcache_parent(new_dentry);
> 	if (!is_dir) {
> 		error = try_break_deleg(source, delegated_inode);
> 		if (error)
> @@ -4452,8 +4450,10 @@ int vfs_rename(struct inode *old_dir, struct dentry *old_dentry,
> 		goto out;
> 
> 	if (!(flags & RENAME_EXCHANGE) && target) {
> -		if (is_dir)
> +		if (is_dir) {
> +			shrink_dcache_parent(new_dentry);
> 			target->i_flags |= S_DEAD;

Would it be better to set S_DEAD on the removed directory before
shrink_dcache_parent() is called (here and in vfs_rmdir()), or is
there no way for a new dentry to be added to the parent after the
shrink is done?

Cheers, Andreas
Al Viro May 28, 2018, 2:48 a.m. UTC | #3
On Sun, May 27, 2018 at 08:14:30PM -0600, Andreas Dilger wrote:

> > 	if (!(flags & RENAME_EXCHANGE) && target) {
> > -		if (is_dir)
> > +		if (is_dir) {
> > +			shrink_dcache_parent(new_dentry);
> > 			target->i_flags |= S_DEAD;
> 
> Would it be better to set S_DEAD on the removed directory before
> shrink_dcache_parent() is called (here and in vfs_rmdir()), or is
> there no way for a new dentry to be added to the parent after the
> shrink is done?

It's locked (exclusive).  Lookups (as well as readdir preseeding
of dcache in case of filesystems that do it) are under the same
lock held at least shared.  The same goes for all IS_DEADDIR callers -
exact same locking is used for S_DEAD handling as well.  So the order
really doesn't matter here.
diff mbox

Patch

diff --git a/fs/namei.c b/fs/namei.c
index 186bd2464fd5..269b64a1121a 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3847,11 +3847,11 @@  int vfs_rmdir(struct inode *dir, struct dentry *dentry)
 	if (error)
 		goto out;
 
-	shrink_dcache_parent(dentry);
 	error = dir->i_op->rmdir(dir, dentry);
 	if (error)
 		goto out;
 
+	shrink_dcache_parent(dentry);
 	dentry->d_inode->i_flags |= S_DEAD;
 	dont_mount(dentry);
 	detach_mounts(dentry);
@@ -4434,8 +4434,6 @@  int vfs_rename(struct inode *old_dir, struct dentry *old_dentry,
 		    old_dir->i_nlink >= max_links)
 			goto out;
 	}
-	if (is_dir && !(flags & RENAME_EXCHANGE) && target)
-		shrink_dcache_parent(new_dentry);
 	if (!is_dir) {
 		error = try_break_deleg(source, delegated_inode);
 		if (error)
@@ -4452,8 +4450,10 @@  int vfs_rename(struct inode *old_dir, struct dentry *old_dentry,
 		goto out;
 
 	if (!(flags & RENAME_EXCHANGE) && target) {
-		if (is_dir)
+		if (is_dir) {
+			shrink_dcache_parent(new_dentry);
 			target->i_flags |= S_DEAD;
+		}
 		dont_mount(new_dentry);
 		detach_mounts(new_dentry);
 	}