diff mbox series

shrink_dentry_list() logics change (was Re: [RFC PATCH v3 14/15] dcache: Implement partial shrink via Slab Movable Objects)

Message ID 20190629043803.GT17978@ZenIV.linux.org.uk (mailing list archive)
State New, archived
Headers show
Series shrink_dentry_list() logics change (was Re: [RFC PATCH v3 14/15] dcache: Implement partial shrink via Slab Movable Objects) | expand

Commit Message

Al Viro June 29, 2019, 4:38 a.m. UTC
On Sat, Jun 29, 2019 at 05:08:44AM +0100, Al Viro wrote:
> > The reason we don't hit that problem with regular memory shrinker is
> > this:
> >                 unregister_shrinker(&s->s_shrink);
> >                 fs->kill_sb(s);
> > in deactivate_locked_super().  IOW, shrinker for this fs is gone
> > before we get around to shutdown.  And so are all normal sources
> > of dentry eviction for that fs.
> > 
> > Your earlier variants all suffer the same problem - picking a page
> > shared by dentries from several superblocks can run into trouble
> > if it overlaps with umount of one of those.

PS: the problem is not gone in the next iteration of the patchset in
question.  The patch I'm proposing (including dput_to_list() and _ONLY_
compile-tested) follows.  Comments?

Comments

Al Viro June 29, 2019, 7:06 p.m. UTC | #1
On Sat, Jun 29, 2019 at 05:38:03AM +0100, Al Viro wrote:

> PS: the problem is not gone in the next iteration of the patchset in
> question.  The patch I'm proposing (including dput_to_list() and _ONLY_
> compile-tested) follows.  Comments?

FWIW, there's another unpleasantness in the whole thing.  Suppose we have
picked a page full of dentries, all with refcount 0.  We decide to
evict all of them.  As it turns out, they are from two filesystems.
Filesystem 1 is NFS on a server, with currently downed hub on the way
to it.  Filesystem 2 is local.  We attempt to evict an NFS dentry and
get stuck - tons of dirty data with no way to flush them on server.
In the meanwhile, admin tries to unmount the local filesystem.  And
gets stuck as well, since umount can't do anything to its dentries
that happen to sit in our shrink list.

I wonder if the root of problem here isn't in shrink_dcache_for_umount();
all it really needs is to have everything on that fs with refcount 0
dragged through __dentry_kill().  If something had been on a shrink
list, __dentry_kill() will just leave behind a struct dentry completely
devoid of any connection to superblock, other dentries, filesystem
type, etc. - it's just a piece of memory that won't be freed until
the owner of shrink list finally gets around to it.  Which can happen
at any point - all they'll do to it is dentry_free(), and that doesn't
need any fs-related data structures.

The logics in shrink_dcache_parent() is
	collect everything evictable into a shrink list
	if anything found - kick it out and repeat the scan
	otherwise, if something had been on other's shrink list
		repeat the scan

I wonder if after the "no evictable candidates, but something
on other's shrink lists" we ought to do something along the
lines of
	rcu_read_lock
	walk it, doing
		if dentry has zero refcount
			if it's not on a shrink list,
				move it to ours
			else
				store its address in 'victim'
				end the walk
	if no victim found
		rcu_read_unlock
	else
		lock victim for __dentry_kill
		rcu_read_unlock
		if it's still alive
			if it's not IS_ROOT
				if parent is not on shrink list
					decrement parent's refcount
					put it on our list
				else
					decrement parent's refcount
			__dentry_kill(victim)
		else
			unlock
	if our list is non-empty
		shrink_dentry_list on it
in there...
Al Viro June 29, 2019, 10:29 p.m. UTC | #2
On Sat, Jun 29, 2019 at 08:06:24PM +0100, Al Viro wrote:
> I wonder if after the "no evictable candidates, but something
> on other's shrink lists" we ought to do something along the
> lines of
> 	rcu_read_lock
> 	walk it, doing
> 		if dentry has zero refcount
> 			if it's not on a shrink list,
> 				move it to ours
> 			else
> 				store its address in 'victim'
> 				end the walk
> 	if no victim found
> 		rcu_read_unlock
> 	else
> 		lock victim for __dentry_kill
> 		rcu_read_unlock
> 		if it's still alive
> 			if it's not IS_ROOT
> 				if parent is not on shrink list
> 					decrement parent's refcount
> 					put it on our list
> 				else
> 					decrement parent's refcount
> 			__dentry_kill(victim)
> 		else
> 			unlock
> 	if our list is non-empty
> 		shrink_dentry_list on it
> in there...

Like this (again, only build-tested):

Teach shrink_dcache_parent() to cope with mixed-filesystem shrink lists

Currently, running into a shrink list that contains dentries from different
filesystems can cause several unpleasant things for shrink_dcache_parent()
and for umount(2).

The first problem is that there's a window during shrink_dentry_list() between
__dentry_kill() takes a victim out and dropping reference to its parent.  During
that window the parent looks like a genuine busy dentry.  shrink_dcache_parent()
(or, worse yet, shrink_dcache_for_umount()) coming at that time will see no
eviction candidates and no indication that it needs to wait for some
shrink_dentry_list() to proceed further.

That applies for any shrink list that might intersect with the subtree we are
trying to shrink; the only reason it does not blow on umount(2) in the mainline
is that we unregister the memory shrinker before hitting shrink_dcache_for_umount().

Another problem happens if something in a mixed-filesystem shrink list gets
be stuck in e.g. iput(), getting umount of unrelated fs to spin waiting for
the stuck shrinker to get around to our dentries.

Solution:
	1) have shrink_dentry_list() decrement the parent's refcount and
make sure it's on a shrink list (ours unless it already had been on some
other) before calling __dentry_kill().  That eliminates the window when
shrink_dcache_parent() would've blown past the entire subtree without
noticing anything with zero refcount not on shrink lists.
	2) when shrink_dcache_parent() has found no eviction candidates,
but some dentries are still sitting on shrink lists, rather than
repeating the scan in hope that shrinkers have progressed, scan looking
for something on shrink lists with zero refcount.  If such a thing is
found, grab rcu_read_lock() and stop the scan, with caller locking
it for eviction, dropping out of RCU and doing __dentry_kill(), with
the same treatment for parent as shrink_dentry_list() would do.

Note that right now mixed-filesystem shrink lists do not occur, so this
is not a mainline bug.  Howevere, there's a bunch of uses for such
beasts (e.g. the "try and evict everything we can out of given page"
patches; there are potential uses in mount-related code, considerably
simplifying the life in fs/namespace.c, etc.)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---

diff --git a/fs/dcache.c b/fs/dcache.c
index 8136bda27a1f..43480f516329 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -860,6 +860,32 @@ void dput(struct dentry *dentry)
 }
 EXPORT_SYMBOL(dput);
 
+static void __dput_to_list(struct dentry *dentry, struct list_head *list)
+__must_hold(&dentry->d_lock)
+{
+	if (dentry->d_flags & DCACHE_SHRINK_LIST) {
+		/* let the owner of the list it's on deal with it */
+		--dentry->d_lockref.count;
+	} else {
+		if (dentry->d_flags & DCACHE_LRU_LIST)
+			d_lru_del(dentry);
+		if (!--dentry->d_lockref.count)
+			d_shrink_add(dentry, list);
+	}
+}
+
+void dput_to_list(struct dentry *dentry, struct list_head *list)
+{
+	rcu_read_lock();
+	if (likely(fast_dput(dentry))) {
+		rcu_read_unlock();
+		return;
+	}
+	rcu_read_unlock();
+	if (!retain_dentry(dentry))
+		__dput_to_list(dentry, list);
+	spin_unlock(&dentry->d_lock);
+}
 
 /* This must be called with d_lock held */
 static inline void __dget_dlock(struct dentry *dentry)
@@ -1088,18 +1114,9 @@ static void shrink_dentry_list(struct list_head *list)
 		rcu_read_unlock();
 		d_shrink_del(dentry);
 		parent = dentry->d_parent;
+		if (parent != dentry)
+			__dput_to_list(parent, list);
 		__dentry_kill(dentry);
-		if (parent == dentry)
-			continue;
-		/*
-		 * We need to prune ancestors too. This is necessary to prevent
-		 * quadratic behavior of shrink_dcache_parent(), but is also
-		 * expected to be beneficial in reducing dentry cache
-		 * fragmentation.
-		 */
-		dentry = parent;
-		while (dentry && !lockref_put_or_lock(&dentry->d_lockref))
-			dentry = dentry_kill(dentry);
 	}
 }
 
@@ -1444,8 +1461,11 @@ int d_set_mounted(struct dentry *dentry)
 
 struct select_data {
 	struct dentry *start;
+	union {
+		long found;
+		struct dentry *victim;
+	};
 	struct list_head dispose;
-	int found;
 };
 
 static enum d_walk_ret select_collect(void *_data, struct dentry *dentry)
@@ -1477,6 +1497,37 @@ static enum d_walk_ret select_collect(void *_data, struct dentry *dentry)
 	return ret;
 }
 
+static enum d_walk_ret select_collect2(void *_data, struct dentry *dentry)
+{
+	struct select_data *data = _data;
+	enum d_walk_ret ret = D_WALK_CONTINUE;
+
+	if (data->start == dentry)
+		goto out;
+
+	if (dentry->d_flags & DCACHE_SHRINK_LIST) {
+		if (!dentry->d_lockref.count) {
+			rcu_read_lock();
+			data->victim = dentry;
+			return D_WALK_QUIT;
+		}
+	} else {
+		if (dentry->d_flags & DCACHE_LRU_LIST)
+			d_lru_del(dentry);
+		if (!dentry->d_lockref.count)
+			d_shrink_add(dentry, &data->dispose);
+	}
+	/*
+	 * We can return to the caller if we have found some (this
+	 * ensures forward progress). We'll be coming back to find
+	 * the rest.
+	 */
+	if (!list_empty(&data->dispose))
+		ret = need_resched() ? D_WALK_QUIT : D_WALK_NORETRY;
+out:
+	return ret;
+}
+
 /**
  * shrink_dcache_parent - prune dcache
  * @parent: parent of entries to prune
@@ -1486,12 +1537,9 @@ static enum d_walk_ret select_collect(void *_data, struct dentry *dentry)
 void shrink_dcache_parent(struct dentry *parent)
 {
 	for (;;) {
-		struct select_data data;
+		struct select_data data = {.start = parent};
 
 		INIT_LIST_HEAD(&data.dispose);
-		data.start = parent;
-		data.found = 0;
-
 		d_walk(parent, &data, select_collect);
 
 		if (!list_empty(&data.dispose)) {
@@ -1502,6 +1550,21 @@ void shrink_dcache_parent(struct dentry *parent)
 		cond_resched();
 		if (!data.found)
 			break;
+		d_walk(parent, &data, select_collect2);
+		if (data.victim) {
+			struct dentry *parent;
+			if (!shrink_lock_dentry(data.victim)) {
+				rcu_read_unlock();
+			} else {
+				rcu_read_unlock();
+				parent = data.victim->d_parent;
+				if (parent != data.victim)
+					__dput_to_list(parent, &data.dispose);
+				__dentry_kill(data.victim);
+			}
+		}
+		if (!list_empty(&data.dispose))
+			shrink_dentry_list(&data.dispose);
 	}
 }
 EXPORT_SYMBOL(shrink_dcache_parent);
Al Viro June 29, 2019, 10:34 p.m. UTC | #3
On Sat, Jun 29, 2019 at 11:29:45PM +0100, Al Viro wrote:

> Like this (again, only build-tested):
... and with obvious braino fixed,

Teach shrink_dcache_parent() to cope with mixed-filesystem shrink lists

Currently, running into a shrink list that contains dentries from different
filesystems can cause several unpleasant things for shrink_dcache_parent()
and for umount(2).

The first problem is that there's a window during shrink_dentry_list() between
__dentry_kill() takes a victim out and dropping reference to its parent.  During
that window the parent looks like a genuine busy dentry.  shrink_dcache_parent()
(or, worse yet, shrink_dcache_for_umount()) coming at that time will see no
eviction candidates and no indication that it needs to wait for some
shrink_dentry_list() to proceed further.

That applies for any shrink list that might intersect with the subtree we are
trying to shrink; the only reason it does not blow on umount(2) in the mainline
is that we unregister the memory shrinker before hitting shrink_dcache_for_umount().

Another problem happens if something in a mixed-filesystem shrink list gets
be stuck in e.g. iput(), getting umount of unrelated fs to spin waiting for
the stuck shrinker to get around to our dentries.

Solution:
    1) have shrink_dentry_list() decrement the parent's refcount and
make sure it's on a shrink list (ours unless it already had been on some
other) before calling __dentry_kill().  That eliminates the window when
shrink_dcache_parent() would've blown past the entire subtree without
noticing anything with zero refcount not on shrink lists.
    2) when shrink_dcache_parent() has found no eviction candidates,
but some dentries are still sitting on shrink lists, rather than
repeating the scan in hope that shrinkers have progressed, scan looking
for something on shrink lists with zero refcount.  If such a thing is
found, grab rcu_read_lock() and stop the scan, with caller locking
it for eviction, dropping out of RCU and doing __dentry_kill(), with
the same treatment for parent as shrink_dentry_list() would do.

Note that right now mixed-filesystem shrink lists do not occur, so this
is not a mainline bug.  Howevere, there's a bunch of uses for such
beasts (e.g. the "try and evict everything we can out of given page"
patches; there are potential uses in mount-related code, considerably
simplifying the life in fs/namespace.c, etc.)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---

diff --git a/fs/dcache.c b/fs/dcache.c
index 8136bda27a1f..4b50e09ee950 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -860,6 +860,32 @@ void dput(struct dentry *dentry)
 }
 EXPORT_SYMBOL(dput);
 
+static void __dput_to_list(struct dentry *dentry, struct list_head *list)
+__must_hold(&dentry->d_lock)
+{
+	if (dentry->d_flags & DCACHE_SHRINK_LIST) {
+		/* let the owner of the list it's on deal with it */
+		--dentry->d_lockref.count;
+	} else {
+		if (dentry->d_flags & DCACHE_LRU_LIST)
+			d_lru_del(dentry);
+		if (!--dentry->d_lockref.count)
+			d_shrink_add(dentry, list);
+	}
+}
+
+void dput_to_list(struct dentry *dentry, struct list_head *list)
+{
+	rcu_read_lock();
+	if (likely(fast_dput(dentry))) {
+		rcu_read_unlock();
+		return;
+	}
+	rcu_read_unlock();
+	if (!retain_dentry(dentry))
+		__dput_to_list(dentry, list);
+	spin_unlock(&dentry->d_lock);
+}
 
 /* This must be called with d_lock held */
 static inline void __dget_dlock(struct dentry *dentry)
@@ -1088,18 +1114,9 @@ static void shrink_dentry_list(struct list_head *list)
 		rcu_read_unlock();
 		d_shrink_del(dentry);
 		parent = dentry->d_parent;
+		if (parent != dentry)
+			__dput_to_list(parent, list);
 		__dentry_kill(dentry);
-		if (parent == dentry)
-			continue;
-		/*
-		 * We need to prune ancestors too. This is necessary to prevent
-		 * quadratic behavior of shrink_dcache_parent(), but is also
-		 * expected to be beneficial in reducing dentry cache
-		 * fragmentation.
-		 */
-		dentry = parent;
-		while (dentry && !lockref_put_or_lock(&dentry->d_lockref))
-			dentry = dentry_kill(dentry);
 	}
 }
 
@@ -1444,8 +1461,11 @@ int d_set_mounted(struct dentry *dentry)
 
 struct select_data {
 	struct dentry *start;
+	union {
+		long found;
+		struct dentry *victim;
+	};
 	struct list_head dispose;
-	int found;
 };
 
 static enum d_walk_ret select_collect(void *_data, struct dentry *dentry)
@@ -1477,6 +1497,37 @@ static enum d_walk_ret select_collect(void *_data, struct dentry *dentry)
 	return ret;
 }
 
+static enum d_walk_ret select_collect2(void *_data, struct dentry *dentry)
+{
+	struct select_data *data = _data;
+	enum d_walk_ret ret = D_WALK_CONTINUE;
+
+	if (data->start == dentry)
+		goto out;
+
+	if (dentry->d_flags & DCACHE_SHRINK_LIST) {
+		if (!dentry->d_lockref.count) {
+			rcu_read_lock();
+			data->victim = dentry;
+			return D_WALK_QUIT;
+		}
+	} else {
+		if (dentry->d_flags & DCACHE_LRU_LIST)
+			d_lru_del(dentry);
+		if (!dentry->d_lockref.count)
+			d_shrink_add(dentry, &data->dispose);
+	}
+	/*
+	 * We can return to the caller if we have found some (this
+	 * ensures forward progress). We'll be coming back to find
+	 * the rest.
+	 */
+	if (!list_empty(&data->dispose))
+		ret = need_resched() ? D_WALK_QUIT : D_WALK_NORETRY;
+out:
+	return ret;
+}
+
 /**
  * shrink_dcache_parent - prune dcache
  * @parent: parent of entries to prune
@@ -1486,12 +1537,9 @@ static enum d_walk_ret select_collect(void *_data, struct dentry *dentry)
 void shrink_dcache_parent(struct dentry *parent)
 {
 	for (;;) {
-		struct select_data data;
+		struct select_data data = {.start = parent};
 
 		INIT_LIST_HEAD(&data.dispose);
-		data.start = parent;
-		data.found = 0;
-
 		d_walk(parent, &data, select_collect);
 
 		if (!list_empty(&data.dispose)) {
@@ -1502,6 +1550,22 @@ void shrink_dcache_parent(struct dentry *parent)
 		cond_resched();
 		if (!data.found)
 			break;
+		data.victim = NULL;
+		d_walk(parent, &data, select_collect2);
+		if (data.victim) {
+			struct dentry *parent;
+			if (!shrink_lock_dentry(data.victim)) {
+				rcu_read_unlock();
+			} else {
+				rcu_read_unlock();
+				parent = data.victim->d_parent;
+				if (parent != data.victim)
+					__dput_to_list(parent, &data.dispose);
+				__dentry_kill(data.victim);
+			}
+		}
+		if (!list_empty(&data.dispose))
+			shrink_dentry_list(&data.dispose);
 	}
 }
 EXPORT_SYMBOL(shrink_dcache_parent);
diff --git a/fs/internal.h b/fs/internal.h
index 0010889f2e85..68f132cf2664 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -160,6 +160,7 @@ extern int d_set_mounted(struct dentry *dentry);
 extern long prune_dcache_sb(struct super_block *sb, struct shrink_control *sc);
 extern struct dentry *d_alloc_cursor(struct dentry *);
 extern struct dentry * d_alloc_pseudo(struct super_block *, const struct qstr *);
+extern void dput_to_list(struct dentry *, struct list_head *);
 
 /*
  * read_write.c
Tobin Harding July 1, 2019, 9:26 a.m. UTC | #4
On Sat, Jun 29, 2019 at 08:06:24PM +0100, Al Viro wrote:
> On Sat, Jun 29, 2019 at 05:38:03AM +0100, Al Viro wrote:
> 
> > PS: the problem is not gone in the next iteration of the patchset in
> > question.  The patch I'm proposing (including dput_to_list() and _ONLY_
> > compile-tested) follows.  Comments?
> 
> FWIW, there's another unpleasantness in the whole thing.  Suppose we have
> picked a page full of dentries, all with refcount 0.  We decide to
> evict all of them.  As it turns out, they are from two filesystems.
> Filesystem 1 is NFS on a server, with currently downed hub on the way
> to it.  Filesystem 2 is local.  We attempt to evict an NFS dentry and
> get stuck - tons of dirty data with no way to flush them on server.
> In the meanwhile, admin tries to unmount the local filesystem.  And
> gets stuck as well, since umount can't do anything to its dentries
> that happen to sit in our shrink list.
> 
> I wonder if the root of problem here isn't in shrink_dcache_for_umount();
> all it really needs is to have everything on that fs with refcount 0
> dragged through __dentry_kill().  If something had been on a shrink
> list, __dentry_kill() will just leave behind a struct dentry completely
> devoid of any connection to superblock, other dentries, filesystem
> type, etc. - it's just a piece of memory that won't be freed until
> the owner of shrink list finally gets around to it.  Which can happen
> at any point - all they'll do to it is dentry_free(), and that doesn't
> need any fs-related data structures.
> 
> The logics in shrink_dcache_parent() is
> 	collect everything evictable into a shrink list
> 	if anything found - kick it out and repeat the scan
> 	otherwise, if something had been on other's shrink list
> 		repeat the scan
> 
> I wonder if after the "no evictable candidates, but something
> on other's shrink lists" we ought to do something along the
> lines of
> 	rcu_read_lock
> 	walk it, doing
> 		if dentry has zero refcount
> 			if it's not on a shrink list,
> 				move it to ours
> 			else
> 				store its address in 'victim'
> 				end the walk
> 	if no victim found
> 		rcu_read_unlock
> 	else
> 		lock victim for __dentry_kill
> 		rcu_read_unlock
> 		if it's still alive
> 			if it's not IS_ROOT
> 				if parent is not on shrink list
> 					decrement parent's refcount
> 					put it on our list
> 				else
> 					decrement parent's refcount
> 			__dentry_kill(victim)
> 		else
> 			unlock
> 	if our list is non-empty
> 		shrink_dentry_list on it
> in there...

Thanks for still thinking about this Al.  I don't have a lot of idea
about what to do with your comments until I can grok them fully but I
wanted to acknowledge having read them.

Thanks,
Tobin.
diff mbox series

Patch

diff --git a/fs/dcache.c b/fs/dcache.c
index 8136bda27a1f..dfe21a649c96 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -860,6 +860,32 @@  void dput(struct dentry *dentry)
 }
 EXPORT_SYMBOL(dput);
 
+static void __dput_to_list(struct dentry *dentry, struct list_head *list)
+__must_hold(&dentry->d_lock)
+{
+	if (dentry->d_flags & DCACHE_SHRINK_LIST) {
+		/* let the owner of the list it's on deal with it */
+		--dentry->d_lockref.count;
+	} else {
+		if (dentry->d_flags & DCACHE_LRU_LIST)
+			d_lru_del(dentry);
+		if (!--dentry->d_lockref.count)
+			d_shrink_add(dentry, list);
+	}
+}
+
+void dput_to_list(struct dentry *dentry, struct list_head *list)
+{
+	rcu_read_lock();
+	if (likely(fast_dput(dentry))) {
+		rcu_read_unlock();
+		return;
+	}
+	rcu_read_unlock();
+	if (!retain_dentry(dentry))
+		__dput_to_list(dentry, list);
+	spin_unlock(&dentry->d_lock);
+}
 
 /* This must be called with d_lock held */
 static inline void __dget_dlock(struct dentry *dentry)
@@ -1088,18 +1114,9 @@  static void shrink_dentry_list(struct list_head *list)
 		rcu_read_unlock();
 		d_shrink_del(dentry);
 		parent = dentry->d_parent;
+		if (parent != dentry)
+			__dput_to_list(parent, list);
 		__dentry_kill(dentry);
-		if (parent == dentry)
-			continue;
-		/*
-		 * We need to prune ancestors too. This is necessary to prevent
-		 * quadratic behavior of shrink_dcache_parent(), but is also
-		 * expected to be beneficial in reducing dentry cache
-		 * fragmentation.
-		 */
-		dentry = parent;
-		while (dentry && !lockref_put_or_lock(&dentry->d_lockref))
-			dentry = dentry_kill(dentry);
 	}
 }