diff mbox series

[RFC,1/6] dcache: sweep cached negative dentries to the end of list of siblings

Message ID 1611235185-1685-2-git-send-email-gautham.ananthakrishna@oracle.com (mailing list archive)
State New, archived
Headers show
Series fix the negative dentres bloating system memory usage | expand

Commit Message

Gautham Ananthakrishna Jan. 21, 2021, 1:19 p.m. UTC
From: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>

For disk filesystems result of every negative lookup is cached, content of
directories is usually cached too. Production of negative dentries isn't
limited with disk speed. It's really easy to generate millions of them if
system has enough memory. Negative dentries are linked into siblings list
along with normal positive dentries. Some operations walks dcache tree but
looks only for positive dentries: most important is fsnotify/inotify.

This patch moves negative dentries to the end of list at final dput() and
marks with flag which tells that all following dentries are negative too.
Reverse operation is required before instantiating negative dentry.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Gautham Ananthakrishna <gautham.ananthakrishna@oracle.com>
---
 fs/dcache.c            | 59 +++++++++++++++++++++++++++++++++++++++++++++++---
 include/linux/dcache.h |  6 +++++
 2 files changed, 62 insertions(+), 3 deletions(-)

Comments

Al Viro April 14, 2021, 3 a.m. UTC | #1
On Thu, Jan 21, 2021 at 06:49:40PM +0530, Gautham Ananthakrishna wrote:
> From: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> 
> For disk filesystems result of every negative lookup is cached, content of
> directories is usually cached too. Production of negative dentries isn't
> limited with disk speed. It's really easy to generate millions of them if
> system has enough memory. Negative dentries are linked into siblings list
> along with normal positive dentries. Some operations walks dcache tree but
> looks only for positive dentries: most important is fsnotify/inotify.
> 
> This patch moves negative dentries to the end of list at final dput() and
> marks with flag which tells that all following dentries are negative too.
> Reverse operation is required before instantiating negative dentry.

> +static void sweep_negative(struct dentry *dentry)
> +{
> +	struct dentry *parent;
> +
> +	if (!d_is_tail_negative(dentry)) {
> +		parent = lock_parent(dentry);
> +		if (!parent)
> +			return;
> +
> +		if (!d_count(dentry) && d_is_negative(dentry) &&
> +		    !d_is_tail_negative(dentry)) {
> +			dentry->d_flags |= DCACHE_TAIL_NEGATIVE;
> +			list_move_tail(&dentry->d_child, &parent->d_subdirs);
> +		}
> +
> +		spin_unlock(&parent->d_lock);
> +	}
> +}

Ugh...  So when dput() drives the refcount down to 0 you hit lock_parent()
and only then bother to check if the sucker had been negative in the first
place?

> @@ -1970,6 +2021,8 @@ void d_instantiate(struct dentry *entry, struct inode * inode)
>  {
>  	BUG_ON(!hlist_unhashed(&entry->d_u.d_alias));
>  	if (inode) {
> +		if (d_is_tail_negative(entry))
> +			recycle_negative(entry);
>  		security_d_instantiate(entry, inode);
>  		spin_lock(&inode->i_lock);
>  		__d_instantiate(entry, inode);

Wait a bloody minute.  What about d_instantiate_new() right next to it?
Al Viro April 14, 2021, 3:41 a.m. UTC | #2
On Thu, Jan 21, 2021 at 06:49:40PM +0530, Gautham Ananthakrishna wrote:

> +static void sweep_negative(struct dentry *dentry)
> +{
> +	struct dentry *parent;
> +
> +	if (!d_is_tail_negative(dentry)) {
> +		parent = lock_parent(dentry);
> +		if (!parent)
> +			return;

Wait a minute.  It's not a good environment for calling lock_parent().
Who said that dentry won't get freed right under it?

Right now callers of __lock_parent() either hold a reference to dentry
*or* are called for a positive dentry, with inode->i_lock held.
You are introducing something very different - 

>  		if (likely(retain_dentry(dentry))) {
> +			if (d_is_negative(dentry))
> +				sweep_negative(dentry);
>  			spin_unlock(&dentry->d_lock);

Here we can be called for a negative dentry with refcount already *NOT*
held by us.  Look:

static inline struct dentry *lock_parent(struct dentry *dentry)
{
        struct dentry *parent = dentry->d_parent;
	if (IS_ROOT(dentry))
		return NULL;
isn't a root

	if (likely(spin_trylock(&parent->d_lock)))
		return parent;

no such luck - someone's already holding parent's ->d_lock

	return __lock_parent(dentry);
and here we have
static struct dentry *__lock_parent(struct dentry *dentry)
{
	struct dentry *parent;
	rcu_read_lock();  

OK, anything we see in its ->d_parent is guaranteed to stay
allocated until we get to matching rcu_read_unlock()

	spin_unlock(&dentry->d_lock);
dropped the spinlock, now it's fair game for d_move(), d_drop(), etc.

again:
	parent = READ_ONCE(dentry->d_parent);
dentry couldn't have been reused, so it's the last value stored there.
Points to still allocated struct dentry instance, so we can...

	spin_lock(&parent->d_lock);
grab its ->d_lock.

	/*
	 * We can't blindly lock dentry until we are sure
	 * that we won't violate the locking order.
	 * Any changes of dentry->d_parent must have
	 * been done with parent->d_lock held, so
	 * spin_lock() above is enough of a barrier
	 * for checking if it's still our child.
	 */
	if (unlikely(parent != dentry->d_parent)) {
		spin_unlock(&parent->d_lock);
		goto again;
	}
Nevermind, it's still equal to our ->d_parent.  So we have
the last valid parent's ->d_lock held

	rcu_read_unlock();
What's to hold dentry allocated now?  IF we held its refcount - no
problem, it can't go away.  If we held its ->d_inode->i_lock - ditto
(it wouldn't get to __dentry_kill() until we drop that, since all
callers do acquire that lock and it couldn't get scheduled for
freeing until it gets through most of __dentry_kill()).

IOW, we are free to grab dentry->d_lock again.
	if (parent != dentry)
		spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
	else
		parent = NULL;
	return parent;
}

With your patch, though, you've got a call site where neither condition
is guaranteed.  Current kernel is fine - we are holding ->d_lock there,
and we don't touch dentry after it gets dropped.  Again, it can't get
scheduled for freeing until after we drop ->d_lock, so we are safe.
With that change, however, you've got a hard-to-hit memory corruptor
there...
Al Viro April 15, 2021, 4:25 p.m. UTC | #3
On Wed, Apr 14, 2021 at 03:41:10AM +0000, Al Viro wrote:

> > +	if (!d_is_tail_negative(dentry)) {
> > +		parent = lock_parent(dentry);
> > +		if (!parent)
> > +			return;
> 
> Wait a minute.  It's not a good environment for calling lock_parent().
> Who said that dentry won't get freed right under it?

[snip]

FWIW, in that state (dentry->d_lock held) we have
	* stable ->d_flags
	* stable ->d_count
	* stable ->d_inode
IOW, we can bloody well check ->d_count *before* bothering with lock_parent().
It does not get rid of the problems with lifetimes, though.  We could
do something along the lines of

	rcu_read_lock()
	if retain_dentry()
		parent = NULL
		if that dentry might need to be moved in list
			parent = lock_parent()
			// if reached __dentry_kill(), d_count() will be -128,
			// so the check below will exclude those
			if that dentry does need to be moved
				move it to the end of list
		unlock dentry and parent (if any)
		rcu_read_unlock()
		return
here, but your other uses of lock_parent() also need attention.  And
recursive call of dput() in trim_negative() (#6/6) is really asking
for trouble.
Al Viro April 15, 2021, 4:50 p.m. UTC | #4
On Wed, Apr 14, 2021 at 03:00:48AM +0000, Al Viro wrote:

> Ugh...  So when dput() drives the refcount down to 0 you hit lock_parent()
> and only then bother to check if the sucker had been negative in the first
                                              ^^^^^^^^^^^^^^^^^
                                              had zero refcount, of course.
> place?

> > @@ -1970,6 +2021,8 @@ void d_instantiate(struct dentry *entry, struct inode * inode)
> >  {
> >  	BUG_ON(!hlist_unhashed(&entry->d_u.d_alias));
> >  	if (inode) {
> > +		if (d_is_tail_negative(entry))
> > +			recycle_negative(entry);
> >  		security_d_instantiate(entry, inode);
> >  		spin_lock(&inode->i_lock);
> >  		__d_instantiate(entry, inode);
> 
> Wait a bloody minute.  What about d_instantiate_new() right next to it?

Another fun question: where's the proof that __d_add(dentry, non_NULL_inode)
won't happen to dentry marked tail-negative?  From a quick grep I see at
least one such place - on success cifs_do_create() does
        d_drop(direntry);
	d_add(direntry, newinode);
and it would bloody well evade what you are doing in d_instantiate().
Same seems to be true for nfs_link()...
diff mbox series

Patch

diff --git a/fs/dcache.c b/fs/dcache.c
index ea04858..a506169 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -632,6 +632,48 @@  static inline struct dentry *lock_parent(struct dentry *dentry)
 	return __lock_parent(dentry);
 }
 
+/*
+ * Move cached negative dentry to the tail of parent->d_subdirs.
+ * This lets walkers skip them all together at first sight.
+ * Must be called at dput of negative dentry.
+ */
+static void sweep_negative(struct dentry *dentry)
+{
+	struct dentry *parent;
+
+	if (!d_is_tail_negative(dentry)) {
+		parent = lock_parent(dentry);
+		if (!parent)
+			return;
+
+		if (!d_count(dentry) && d_is_negative(dentry) &&
+		    !d_is_tail_negative(dentry)) {
+			dentry->d_flags |= DCACHE_TAIL_NEGATIVE;
+			list_move_tail(&dentry->d_child, &parent->d_subdirs);
+		}
+
+		spin_unlock(&parent->d_lock);
+	}
+}
+
+/*
+ * Undo sweep_negative() and move to the head of parent->d_subdirs.
+ * Must be called before converting negative dentry into positive.
+ */
+static void recycle_negative(struct dentry *dentry)
+{
+	struct dentry *parent;
+
+	spin_lock(&dentry->d_lock);
+	parent = lock_parent(dentry);
+	dentry->d_flags &= ~DCACHE_TAIL_NEGATIVE;
+	if (parent) {
+		list_move(&dentry->d_child, &parent->d_subdirs);
+		spin_unlock(&parent->d_lock);
+	}
+	spin_unlock(&dentry->d_lock);
+}
+
 static inline bool retain_dentry(struct dentry *dentry)
 {
 	WARN_ON(d_in_lookup(dentry));
@@ -737,7 +779,7 @@  static struct dentry *dentry_kill(struct dentry *dentry)
 static inline bool fast_dput(struct dentry *dentry)
 {
 	int ret;
-	unsigned int d_flags;
+	unsigned int d_flags, required;
 
 	/*
 	 * If we have a d_op->d_delete() operation, we sould not
@@ -785,6 +827,8 @@  static inline bool fast_dput(struct dentry *dentry)
 	 * a 'delete' op, and it's referenced and already on
 	 * the LRU list.
 	 *
+	 * Cached negative dentry must be swept to the tail.
+	 *
 	 * NOTE! Since we aren't locked, these values are
 	 * not "stable". However, it is sufficient that at
 	 * some point after we dropped the reference the
@@ -796,10 +840,15 @@  static inline bool fast_dput(struct dentry *dentry)
 	 */
 	smp_rmb();
 	d_flags = READ_ONCE(dentry->d_flags);
-	d_flags &= DCACHE_REFERENCED | DCACHE_LRU_LIST | DCACHE_DISCONNECTED;
+
+	required = DCACHE_REFERENCED | DCACHE_LRU_LIST |
+		(d_flags_negative(d_flags) ? DCACHE_TAIL_NEGATIVE : 0);
+
+	d_flags &= DCACHE_REFERENCED | DCACHE_LRU_LIST |
+		DCACHE_DISCONNECTED | DCACHE_TAIL_NEGATIVE;
 
 	/* Nothing to do? Dropping the reference was all we needed? */
-	if (d_flags == (DCACHE_REFERENCED | DCACHE_LRU_LIST) && !d_unhashed(dentry))
+	if (d_flags == required && !d_unhashed(dentry))
 		return true;
 
 	/*
@@ -871,6 +920,8 @@  void dput(struct dentry *dentry)
 		rcu_read_unlock();
 
 		if (likely(retain_dentry(dentry))) {
+			if (d_is_negative(dentry))
+				sweep_negative(dentry);
 			spin_unlock(&dentry->d_lock);
 			return;
 		}
@@ -1970,6 +2021,8 @@  void d_instantiate(struct dentry *entry, struct inode * inode)
 {
 	BUG_ON(!hlist_unhashed(&entry->d_u.d_alias));
 	if (inode) {
+		if (d_is_tail_negative(entry))
+			recycle_negative(entry);
 		security_d_instantiate(entry, inode);
 		spin_lock(&inode->i_lock);
 		__d_instantiate(entry, inode);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 6f95c33..5f4ce3a 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -219,6 +219,7 @@  struct dentry_operations {
 #define DCACHE_PAR_LOOKUP		0x10000000 /* being looked up (with parent locked shared) */
 #define DCACHE_DENTRY_CURSOR		0x20000000
 #define DCACHE_NORCU			0x40000000 /* No RCU delay for freeing */
+#define DCACHE_TAIL_NEGATIVE		0x80000000 /* All following siblings are negative */
 
 extern seqlock_t rename_lock;
 
@@ -495,6 +496,11 @@  static inline int simple_positive(const struct dentry *dentry)
 	return d_really_is_positive(dentry) && !d_unhashed(dentry);
 }
 
+static inline bool d_is_tail_negative(const struct dentry *dentry)
+{
+	return unlikely(dentry->d_flags & DCACHE_TAIL_NEGATIVE);
+}
+
 extern void d_set_fallthru(struct dentry *dentry);
 
 static inline bool d_is_fallthru(const struct dentry *dentry)