Message ID | 1611235185-1685-2-git-send-email-gautham.ananthakrishna@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | fix the negative dentres bloating system memory usage | expand |
On Thu, Jan 21, 2021 at 06:49:40PM +0530, Gautham Ananthakrishna wrote: > From: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> > > For disk filesystems result of every negative lookup is cached, content of > directories is usually cached too. Production of negative dentries isn't > limited with disk speed. It's really easy to generate millions of them if > system has enough memory. Negative dentries are linked into siblings list > along with normal positive dentries. Some operations walks dcache tree but > looks only for positive dentries: most important is fsnotify/inotify. > > This patch moves negative dentries to the end of list at final dput() and > marks with flag which tells that all following dentries are negative too. > Reverse operation is required before instantiating negative dentry. > +static void sweep_negative(struct dentry *dentry) > +{ > + struct dentry *parent; > + > + if (!d_is_tail_negative(dentry)) { > + parent = lock_parent(dentry); > + if (!parent) > + return; > + > + if (!d_count(dentry) && d_is_negative(dentry) && > + !d_is_tail_negative(dentry)) { > + dentry->d_flags |= DCACHE_TAIL_NEGATIVE; > + list_move_tail(&dentry->d_child, &parent->d_subdirs); > + } > + > + spin_unlock(&parent->d_lock); > + } > +} Ugh... So when dput() drives the refcount down to 0 you hit lock_parent() and only then bother to check if the sucker had been negative in the first place? > @@ -1970,6 +2021,8 @@ void d_instantiate(struct dentry *entry, struct inode * inode) > { > BUG_ON(!hlist_unhashed(&entry->d_u.d_alias)); > if (inode) { > + if (d_is_tail_negative(entry)) > + recycle_negative(entry); > security_d_instantiate(entry, inode); > spin_lock(&inode->i_lock); > __d_instantiate(entry, inode); Wait a bloody minute. What about d_instantiate_new() right next to it?
On Thu, Jan 21, 2021 at 06:49:40PM +0530, Gautham Ananthakrishna wrote: > +static void sweep_negative(struct dentry *dentry) > +{ > + struct dentry *parent; > + > + if (!d_is_tail_negative(dentry)) { > + parent = lock_parent(dentry); > + if (!parent) > + return; Wait a minute. It's not a good environment for calling lock_parent(). Who said that dentry won't get freed right under it? Right now callers of __lock_parent() either hold a reference to dentry *or* are called for a positive dentry, with inode->i_lock held. You are introducing something very different - > if (likely(retain_dentry(dentry))) { > + if (d_is_negative(dentry)) > + sweep_negative(dentry); > spin_unlock(&dentry->d_lock); Here we can be called for a negative dentry with refcount already *NOT* held by us. Look: static inline struct dentry *lock_parent(struct dentry *dentry) { struct dentry *parent = dentry->d_parent; if (IS_ROOT(dentry)) return NULL; isn't a root if (likely(spin_trylock(&parent->d_lock))) return parent; no such luck - someone's already holding parent's ->d_lock return __lock_parent(dentry); and here we have static struct dentry *__lock_parent(struct dentry *dentry) { struct dentry *parent; rcu_read_lock(); OK, anything we see in its ->d_parent is guaranteed to stay allocated until we get to matching rcu_read_unlock() spin_unlock(&dentry->d_lock); dropped the spinlock, now it's fair game for d_move(), d_drop(), etc. again: parent = READ_ONCE(dentry->d_parent); dentry couldn't have been reused, so it's the last value stored there. Points to still allocated struct dentry instance, so we can... spin_lock(&parent->d_lock); grab its ->d_lock. /* * We can't blindly lock dentry until we are sure * that we won't violate the locking order. * Any changes of dentry->d_parent must have * been done with parent->d_lock held, so * spin_lock() above is enough of a barrier * for checking if it's still our child. */ if (unlikely(parent != dentry->d_parent)) { spin_unlock(&parent->d_lock); goto again; } Nevermind, it's still equal to our ->d_parent. So we have the last valid parent's ->d_lock held rcu_read_unlock(); What's to hold dentry allocated now? IF we held its refcount - no problem, it can't go away. If we held its ->d_inode->i_lock - ditto (it wouldn't get to __dentry_kill() until we drop that, since all callers do acquire that lock and it couldn't get scheduled for freeing until it gets through most of __dentry_kill()). IOW, we are free to grab dentry->d_lock again. if (parent != dentry) spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED); else parent = NULL; return parent; } With your patch, though, you've got a call site where neither condition is guaranteed. Current kernel is fine - we are holding ->d_lock there, and we don't touch dentry after it gets dropped. Again, it can't get scheduled for freeing until after we drop ->d_lock, so we are safe. With that change, however, you've got a hard-to-hit memory corruptor there...
On Wed, Apr 14, 2021 at 03:41:10AM +0000, Al Viro wrote: > > + if (!d_is_tail_negative(dentry)) { > > + parent = lock_parent(dentry); > > + if (!parent) > > + return; > > Wait a minute. It's not a good environment for calling lock_parent(). > Who said that dentry won't get freed right under it? [snip] FWIW, in that state (dentry->d_lock held) we have * stable ->d_flags * stable ->d_count * stable ->d_inode IOW, we can bloody well check ->d_count *before* bothering with lock_parent(). It does not get rid of the problems with lifetimes, though. We could do something along the lines of rcu_read_lock() if retain_dentry() parent = NULL if that dentry might need to be moved in list parent = lock_parent() // if reached __dentry_kill(), d_count() will be -128, // so the check below will exclude those if that dentry does need to be moved move it to the end of list unlock dentry and parent (if any) rcu_read_unlock() return here, but your other uses of lock_parent() also need attention. And recursive call of dput() in trim_negative() (#6/6) is really asking for trouble.
On Wed, Apr 14, 2021 at 03:00:48AM +0000, Al Viro wrote: > Ugh... So when dput() drives the refcount down to 0 you hit lock_parent() > and only then bother to check if the sucker had been negative in the first ^^^^^^^^^^^^^^^^^ had zero refcount, of course. > place? > > @@ -1970,6 +2021,8 @@ void d_instantiate(struct dentry *entry, struct inode * inode) > > { > > BUG_ON(!hlist_unhashed(&entry->d_u.d_alias)); > > if (inode) { > > + if (d_is_tail_negative(entry)) > > + recycle_negative(entry); > > security_d_instantiate(entry, inode); > > spin_lock(&inode->i_lock); > > __d_instantiate(entry, inode); > > Wait a bloody minute. What about d_instantiate_new() right next to it? Another fun question: where's the proof that __d_add(dentry, non_NULL_inode) won't happen to dentry marked tail-negative? From a quick grep I see at least one such place - on success cifs_do_create() does d_drop(direntry); d_add(direntry, newinode); and it would bloody well evade what you are doing in d_instantiate(). Same seems to be true for nfs_link()...
diff --git a/fs/dcache.c b/fs/dcache.c index ea04858..a506169 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -632,6 +632,48 @@ static inline struct dentry *lock_parent(struct dentry *dentry) return __lock_parent(dentry); } +/* + * Move cached negative dentry to the tail of parent->d_subdirs. + * This lets walkers skip them all together at first sight. + * Must be called at dput of negative dentry. + */ +static void sweep_negative(struct dentry *dentry) +{ + struct dentry *parent; + + if (!d_is_tail_negative(dentry)) { + parent = lock_parent(dentry); + if (!parent) + return; + + if (!d_count(dentry) && d_is_negative(dentry) && + !d_is_tail_negative(dentry)) { + dentry->d_flags |= DCACHE_TAIL_NEGATIVE; + list_move_tail(&dentry->d_child, &parent->d_subdirs); + } + + spin_unlock(&parent->d_lock); + } +} + +/* + * Undo sweep_negative() and move to the head of parent->d_subdirs. + * Must be called before converting negative dentry into positive. + */ +static void recycle_negative(struct dentry *dentry) +{ + struct dentry *parent; + + spin_lock(&dentry->d_lock); + parent = lock_parent(dentry); + dentry->d_flags &= ~DCACHE_TAIL_NEGATIVE; + if (parent) { + list_move(&dentry->d_child, &parent->d_subdirs); + spin_unlock(&parent->d_lock); + } + spin_unlock(&dentry->d_lock); +} + static inline bool retain_dentry(struct dentry *dentry) { WARN_ON(d_in_lookup(dentry)); @@ -737,7 +779,7 @@ static struct dentry *dentry_kill(struct dentry *dentry) static inline bool fast_dput(struct dentry *dentry) { int ret; - unsigned int d_flags; + unsigned int d_flags, required; /* * If we have a d_op->d_delete() operation, we sould not @@ -785,6 +827,8 @@ static inline bool fast_dput(struct dentry *dentry) * a 'delete' op, and it's referenced and already on * the LRU list. * + * Cached negative dentry must be swept to the tail. + * * NOTE! Since we aren't locked, these values are * not "stable". However, it is sufficient that at * some point after we dropped the reference the @@ -796,10 +840,15 @@ static inline bool fast_dput(struct dentry *dentry) */ smp_rmb(); d_flags = READ_ONCE(dentry->d_flags); - d_flags &= DCACHE_REFERENCED | DCACHE_LRU_LIST | DCACHE_DISCONNECTED; + + required = DCACHE_REFERENCED | DCACHE_LRU_LIST | + (d_flags_negative(d_flags) ? DCACHE_TAIL_NEGATIVE : 0); + + d_flags &= DCACHE_REFERENCED | DCACHE_LRU_LIST | + DCACHE_DISCONNECTED | DCACHE_TAIL_NEGATIVE; /* Nothing to do? Dropping the reference was all we needed? */ - if (d_flags == (DCACHE_REFERENCED | DCACHE_LRU_LIST) && !d_unhashed(dentry)) + if (d_flags == required && !d_unhashed(dentry)) return true; /* @@ -871,6 +920,8 @@ void dput(struct dentry *dentry) rcu_read_unlock(); if (likely(retain_dentry(dentry))) { + if (d_is_negative(dentry)) + sweep_negative(dentry); spin_unlock(&dentry->d_lock); return; } @@ -1970,6 +2021,8 @@ void d_instantiate(struct dentry *entry, struct inode * inode) { BUG_ON(!hlist_unhashed(&entry->d_u.d_alias)); if (inode) { + if (d_is_tail_negative(entry)) + recycle_negative(entry); security_d_instantiate(entry, inode); spin_lock(&inode->i_lock); __d_instantiate(entry, inode); diff --git a/include/linux/dcache.h b/include/linux/dcache.h index 6f95c33..5f4ce3a 100644 --- a/include/linux/dcache.h +++ b/include/linux/dcache.h @@ -219,6 +219,7 @@ struct dentry_operations { #define DCACHE_PAR_LOOKUP 0x10000000 /* being looked up (with parent locked shared) */ #define DCACHE_DENTRY_CURSOR 0x20000000 #define DCACHE_NORCU 0x40000000 /* No RCU delay for freeing */ +#define DCACHE_TAIL_NEGATIVE 0x80000000 /* All following siblings are negative */ extern seqlock_t rename_lock; @@ -495,6 +496,11 @@ static inline int simple_positive(const struct dentry *dentry) return d_really_is_positive(dentry) && !d_unhashed(dentry); } +static inline bool d_is_tail_negative(const struct dentry *dentry) +{ + return unlikely(dentry->d_flags & DCACHE_TAIL_NEGATIVE); +} + extern void d_set_fallthru(struct dentry *dentry); static inline bool d_is_fallthru(const struct dentry *dentry)