[RFC,v3,14/15] dcache: Implement partial shrink via Slab Movable Objects
diff mbox series

Message ID 20190411013441.5415-15-tobin@kernel.org
State New
Headers show
Series
  • Slab Movable Objects (SMO)
Related show

Commit Message

Tobin C. Harding April 11, 2019, 1:34 a.m. UTC
The dentry slab cache is susceptible to internal fragmentation.  Now
that we have Slab Movable Objects we can attempt to defragment the
dcache.  Dentry objects are inherently _not_ relocatable however under
some conditions they can be free'd.  This is the same as shrinking the
dcache but instead of shrinking the whole cache we only attempt to free
those objects that are located in partially full slab pages.  There is
no guarantee that this will reduce the memory usage of the system, it is
a compromise between fragmented memory and total cache shrinkage with
the hope that some memory pressure can be alleviated.

This is implemented using the newly added Slab Movable Objects
infrastructure.  The dcache 'migration' function is intentionally _not_
called 'd_migrate' because we only free, we do not migrate.  Call it
'd_partial_shrink' to make explicit that no reallocation is done.

Implement isolate and 'migrate' functions for the dentry slab cache.

Signed-off-by: Tobin C. Harding <tobin@kernel.org>
---
 fs/dcache.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 71 insertions(+)

Comments

Al Viro April 11, 2019, 2:33 a.m. UTC | #1
On Thu, Apr 11, 2019 at 11:34:40AM +1000, Tobin C. Harding wrote:
> +/*
> + * d_isolate() - Dentry isolation callback function.
> + * @s: The dentry cache.
> + * @v: Vector of pointers to the objects to isolate.
> + * @nr: Number of objects in @v.
> + *
> + * The slab allocator is holding off frees. We can safely examine
> + * the object without the danger of it vanishing from under us.
> + */
> +static void *d_isolate(struct kmem_cache *s, void **v, int nr)
> +{
> +	struct dentry *dentry;
> +	int i;
> +
> +	for (i = 0; i < nr; i++) {
> +		dentry = v[i];
> +		__dget(dentry);
> +	}
> +
> +	return NULL;		/* No need for private data */
> +}

Huh?  This is compeletely wrong; what you need is collecting the ones
with zero refcount (and not on shrink lists) into a private list.
*NOT* bumping the refcounts at all.  And do it in your isolate thing.

> +static void d_partial_shrink(struct kmem_cache *s, void **v, int nr,
> +		      int node, void *_unused)
> +{
> +	struct dentry *dentry;
> +	LIST_HEAD(dispose);
> +	int i;
> +
> +	for (i = 0; i < nr; i++) {
> +		dentry = v[i];
> +		spin_lock(&dentry->d_lock);
> +		dentry->d_lockref.count--;
> +
> +		if (dentry->d_lockref.count > 0 ||
> +		    dentry->d_flags & DCACHE_SHRINK_LIST) {
> +			spin_unlock(&dentry->d_lock);
> +			continue;
> +		}
> +
> +		if (dentry->d_flags & DCACHE_LRU_LIST)
> +			d_lru_del(dentry);
> +
> +		d_shrink_add(dentry, &dispose);
> +
> +		spin_unlock(&dentry->d_lock);
> +	}

Basically, that loop (sans jerking the refcount up and down) should
get moved into d_isolate().
> +
> +	if (!list_empty(&dispose))
> +		shrink_dentry_list(&dispose);
> +}

... with this left in d_partial_shrink().  And you obviously need some way
to pass the list from the former to the latter...
Tobin C. Harding April 11, 2019, 2:48 a.m. UTC | #2
On Thu, Apr 11, 2019 at 03:33:22AM +0100, Al Viro wrote:
> On Thu, Apr 11, 2019 at 11:34:40AM +1000, Tobin C. Harding wrote:
> > +/*
> > + * d_isolate() - Dentry isolation callback function.
> > + * @s: The dentry cache.
> > + * @v: Vector of pointers to the objects to isolate.
> > + * @nr: Number of objects in @v.
> > + *
> > + * The slab allocator is holding off frees. We can safely examine
> > + * the object without the danger of it vanishing from under us.
> > + */
> > +static void *d_isolate(struct kmem_cache *s, void **v, int nr)
> > +{
> > +	struct dentry *dentry;
> > +	int i;
> > +
> > +	for (i = 0; i < nr; i++) {
> > +		dentry = v[i];
> > +		__dget(dentry);
> > +	}
> > +
> > +	return NULL;		/* No need for private data */
> > +}
> 
> Huh?  This is compeletely wrong; what you need is collecting the ones
> with zero refcount (and not on shrink lists) into a private list.
> *NOT* bumping the refcounts at all.  And do it in your isolate thing.

Oh, so putting entries on a shrink list is enough to pin them?

> 
> > +static void d_partial_shrink(struct kmem_cache *s, void **v, int nr,
> > +		      int node, void *_unused)
> > +{
> > +	struct dentry *dentry;
> > +	LIST_HEAD(dispose);
> > +	int i;
> > +
> > +	for (i = 0; i < nr; i++) {
> > +		dentry = v[i];
> > +		spin_lock(&dentry->d_lock);
> > +		dentry->d_lockref.count--;
> > +
> > +		if (dentry->d_lockref.count > 0 ||
> > +		    dentry->d_flags & DCACHE_SHRINK_LIST) {
> > +			spin_unlock(&dentry->d_lock);
> > +			continue;
> > +		}
> > +
> > +		if (dentry->d_flags & DCACHE_LRU_LIST)
> > +			d_lru_del(dentry);
> > +
> > +		d_shrink_add(dentry, &dispose);
> > +
> > +		spin_unlock(&dentry->d_lock);
> > +	}
> 
> Basically, that loop (sans jerking the refcount up and down) should
> get moved into d_isolate().
> > +
> > +	if (!list_empty(&dispose))
> > +		shrink_dentry_list(&dispose);
> > +}
> 
> ... with this left in d_partial_shrink().  And you obviously need some way
> to pass the list from the former to the latter...

Easy enough, we have a void * return value from the isolate function
just for this purpose.

Thanks Al, hackety hack ...


	Tobin
Al Viro April 11, 2019, 4:47 a.m. UTC | #3
On Thu, Apr 11, 2019 at 12:48:21PM +1000, Tobin C. Harding wrote:

> Oh, so putting entries on a shrink list is enough to pin them?

Not exactly pin, but __dentry_kill() has this:
        if (dentry->d_flags & DCACHE_SHRINK_LIST) {
                dentry->d_flags |= DCACHE_MAY_FREE;
                can_free = false;
        }
        spin_unlock(&dentry->d_lock);
        if (likely(can_free))
                dentry_free(dentry);
and shrink_dentry_list() - this:
                        if (dentry->d_lockref.count < 0)
                                can_free = dentry->d_flags & DCACHE_MAY_FREE;
                        spin_unlock(&dentry->d_lock);
                        if (can_free)
                                dentry_free(dentry);
			continue;
so if dentry destruction comes before we get around to
shrink_dentry_list(), it'll stop short of dentry_free() and mark it for
shrink_dentry_list() to do just dentry_free(); if it overlaps with
shrink_dentry_list(), but doesn't progress all the way to freeing,
we will
	* have dentry removed from shrink list
	* notice the negative ->d_count (i.e. that it has already reached
__dentry_kill())
	* see that __dentry_kill() is not through with tearing the sucker
apart (no DCACHE_MAY_FREE set)
... and just leave it alone, letting __dentry_kill() do the rest of its
thing - it's already off the shrink list, so __dentry_kill() will do
everything, including dentry_free().

The reason for that dance is the locking - shrink list belongs to whoever
has set it up and nobody else is modifying it.  So __dentry_kill() doesn't
even try to remove the victim from there; it does all the teardown
(detaches from inode, unhashes, etc.) and leaves removal from the shrink
list and actual freeing to the owner of shrink list.  That way we don't
have to protect all shrink lists a single lock (contention on it would
be painful) and we don't have to play with per-shrink-list locks and
all the attendant headaches (those lists usually live on stack frame
of some function, so just having the lock next to the list_head would
do us no good, etc.).  Much easier to have the shrink_dentry_list()
do all the manipulations...

The bottom line is, once it's on a shrink list, it'll stay there
until shrink_dentry_list().  It may get extra references after
being inserted there (e.g. be found by hash lookup), it may drop
those, whatever - it won't get freed until we run shrink_dentry_list().
If it ends up with extra references, no problem - shrink_dentry_list()
will just kick it off the shrink list and leave it alone.

Note, BTW, that umount coming between isolate and drop is not a problem;
it call shrink_dcache_parent() on the root.  And if shrink_dcache_parent()
finds something on (another) shrink list, it won't put it to the shrink
list of its own, but it will make note of that and repeat the scan in
such case.  So if we find something with zero refcount and not on
shrink list, we can move it to our shrink list and be sure that its
superblock won't go away under us...
Tobin C. Harding April 11, 2019, 5:05 a.m. UTC | #4
On Thu, Apr 11, 2019 at 05:47:46AM +0100, Al Viro wrote:
> On Thu, Apr 11, 2019 at 12:48:21PM +1000, Tobin C. Harding wrote:
> 
> > Oh, so putting entries on a shrink list is enough to pin them?
> 
> Not exactly pin, but __dentry_kill() has this:
>         if (dentry->d_flags & DCACHE_SHRINK_LIST) {
>                 dentry->d_flags |= DCACHE_MAY_FREE;
>                 can_free = false;
>         }
>         spin_unlock(&dentry->d_lock);
>         if (likely(can_free))
>                 dentry_free(dentry);
> and shrink_dentry_list() - this:
>                         if (dentry->d_lockref.count < 0)
>                                 can_free = dentry->d_flags & DCACHE_MAY_FREE;
>                         spin_unlock(&dentry->d_lock);
>                         if (can_free)
>                                 dentry_free(dentry);
> 			continue;
> so if dentry destruction comes before we get around to
> shrink_dentry_list(), it'll stop short of dentry_free() and mark it for
> shrink_dentry_list() to do just dentry_free(); if it overlaps with
> shrink_dentry_list(), but doesn't progress all the way to freeing,
> we will
> 	* have dentry removed from shrink list
> 	* notice the negative ->d_count (i.e. that it has already reached
> __dentry_kill())
> 	* see that __dentry_kill() is not through with tearing the sucker
> apart (no DCACHE_MAY_FREE set)
> ... and just leave it alone, letting __dentry_kill() do the rest of its
> thing - it's already off the shrink list, so __dentry_kill() will do
> everything, including dentry_free().
> 
> The reason for that dance is the locking - shrink list belongs to whoever
> has set it up and nobody else is modifying it.  So __dentry_kill() doesn't
> even try to remove the victim from there; it does all the teardown
> (detaches from inode, unhashes, etc.) and leaves removal from the shrink
> list and actual freeing to the owner of shrink list.  That way we don't
> have to protect all shrink lists a single lock (contention on it would
> be painful) and we don't have to play with per-shrink-list locks and
> all the attendant headaches (those lists usually live on stack frame
> of some function, so just having the lock next to the list_head would
> do us no good, etc.).  Much easier to have the shrink_dentry_list()
> do all the manipulations...
> 
> The bottom line is, once it's on a shrink list, it'll stay there
> until shrink_dentry_list().  It may get extra references after
> being inserted there (e.g. be found by hash lookup), it may drop
> those, whatever - it won't get freed until we run shrink_dentry_list().
> If it ends up with extra references, no problem - shrink_dentry_list()
> will just kick it off the shrink list and leave it alone.
> 
> Note, BTW, that umount coming between isolate and drop is not a problem;
> it call shrink_dcache_parent() on the root.  And if shrink_dcache_parent()
> finds something on (another) shrink list, it won't put it to the shrink
> list of its own, but it will make note of that and repeat the scan in
> such case.  So if we find something with zero refcount and not on
> shrink list, we can move it to our shrink list and be sure that its
> superblock won't go away under us...

Man, that was good to read.  Thanks for taking the time to write this.


	Tobin
Al Viro April 11, 2019, 8:01 p.m. UTC | #5
On Thu, Apr 11, 2019 at 05:47:46AM +0100, Al Viro wrote:

> The reason for that dance is the locking - shrink list belongs to whoever
> has set it up and nobody else is modifying it.  So __dentry_kill() doesn't
> even try to remove the victim from there; it does all the teardown
> (detaches from inode, unhashes, etc.) and leaves removal from the shrink
> list and actual freeing to the owner of shrink list.  That way we don't
> have to protect all shrink lists a single lock (contention on it would
> be painful) and we don't have to play with per-shrink-list locks and
> all the attendant headaches (those lists usually live on stack frame
> of some function, so just having the lock next to the list_head would
> do us no good, etc.).  Much easier to have the shrink_dentry_list()
> do all the manipulations...
> 
> The bottom line is, once it's on a shrink list, it'll stay there
> until shrink_dentry_list().  It may get extra references after
> being inserted there (e.g. be found by hash lookup), it may drop
> those, whatever - it won't get freed until we run shrink_dentry_list().
> If it ends up with extra references, no problem - shrink_dentry_list()
> will just kick it off the shrink list and leave it alone.

FWIW, here's a braindump of sorts on the late stages of dentry
lifecycle (cut'n'paste from the local notes, with minimal editing;
I think the outright obscenities are all gone, but not much is done
beyond that):

        Events at the end of life

__dentry_kill() is called.  This is the point of no return; the victim
has no counting references left, no new ones are coming and we are
committed to tearing it down.  Caller is holding the following locks:
	a) ->d_lock on dentry itself
	b) ->i_lock on its inode, if dentry is positive
	c) ->d_lock on its parent, if dentry has a parent.
Acquiring those in the sane order (a nests inside of c, which nests inside of b)
can be rather convoluted, but that's the responsibility of callers.

State of dentry at that point:
        * it must not be a PAR_LOOKUP one, if it ever had been.  [See section
on PAR_LOOKUP state, specifically the need to exit that state before
dropping the last reference; <<the section in question is in too disorganised
state to include it here>>].
	* ->d_count is either 0 (eviction pathways - d_prune_aliases(),
shrink_dentry_list()) or 1 (when we are disposing of the last reference
and want it evicted rather than retained - dentry_kill(), called by
dput() or shrink_dentry_list()).  Note that ->d_lock stabilizes ->d_count.
        * its ->d_subdirs must be already empty (or we would've had
counting references from those).  Again, stabilized by ->d_lock.

We can detect dentries having reached that state by observing (under ->d_lock)
a negative ->d_count - that's the very first thing __dentry_kill() does.

At that point ->d_prune() is called - that's the last chance for a filesystem
to see a doomed dentry more or less intact.

After that dentry passes through several stages of teardown:
        * if dentry had been on LRU list, it is removed from there.
        * if dentry had been hashed, it is unhashed (and ->d_seq is
bumped)
        * dentry is made unreachable via d_child
        * dentry is made negative; if it used to be positive, inode
reference is dropped.  That's another place where filesystem might
get a chance to play (->d_iput(), as always for transitions from
positive to negative).  At that stage all spinlocks are dropped.
	* final filesystem call: ->d_release().  That's the time
to release whatever data structures filesystem might've had augmenting
that dentry.  NOTE: lockless accesses are still possible at that
point, so anything needed for those (->d_hash(), ->d_compare(),
lockless case of ->d_revalidate(), lockless case of ->d_manage())
MUST NOT be freed without an RCU delay.

At that stage dentry is essentially a dead body.  It might still
have lockless references hanging around and it might on someone's
shrink list, but that's it.  The next stage is body disposal,
either immediately (if not on anyone's shrink list) or once
the owner of shrink list in question gets around to
shrink_dentry_list().

Disposal is done in dentry_free().  For dentries not on any
shrink list it's called directly from __dentry_kill().  That's
the normal case.  For dentries currently on some shrink list
__dentry_kill() marks the dentry as fully dead (DCACHE_MAY_FREE)
and leave it for eventual shrink_dentry_list() to feed to
dentry_free().

Once dentry_free() is called, there can be only lockless references.
At that point the only things left in the sucker are
	* name (->d_name)
	* superblock it belongs to (->d_sb; won't be freed without
an RCU delay and neither will its file_system_type)
	* methods' table (->d_op)
	* ->d_flags and ->d_seq
	* parent's address (->d_parent; not pinned anymore - its
ownership is passed to caller, which proceeds to drop the reference.
However, parent will also not be freed without an RCU delay,
so lockless users can safely dereference it)
	* ->d_fsdata, if the filesystem had seen fit to leave it
around (see above re RCU delays for destroying anything used
by lockless methods)

Generally we don't get around to actually freeing dentry
(in __d_free()/__d_free_external()) without an RCU delay.

There is one important case where we *do* expedited freeing -
pipes and sockets (to be more precise, the stuff created by
alloc_file_pseudo()).  Those can't have lockless references
at all - they are never hashed, they are not anyone's parents
and they can't be a starting point of a lockless pathwalk
(see path_init() for details).  And they are created and
destroyed often enough to make RCU delays a noticable burden.
So for those we do freeing immediately.  In -next it's
marked by DCACHE_NORCU in flags; in mainline it's a bit of
a mess at the moment.

The reason for __d_free/__d_free_external separation is
somewhat subtle.  We obviously need an RCU delay between
dentry_free() and freeing an external name, but why not
do the "drop refcout on external name and free if it hits
zero" right in __d_free()?  The thing is, we need an RCU
delay between the last decrement of extname refcount and
its freeing.  Suppose we have two dentries that happen
to share an extname.  Initially:

d1->d_name.name == d2->d_name.name == &ext->name; ext->count == 2

CPU1:
dentry_free(d1)
call_rcu() schedules __d_free()

CPU2:
d_path() on child of d2: rcu_read_lock(),
start walking towards root, copying names
get to d2, pick d2->d_name.name (i.e. ext->name)

CPU3:
rename d2, dropping a reference to its old name.
ext->count is 1 now, nothing freed.

CPU2:
start copying ext->name[]

... and scheduled __d_free() runs, dropping the last reference to
ext and freeing it.  The reason is that call_rcu() has happened
*BEFORE* rcu_read_lock(), so we get no protection whatsoever.

In other words, we need the decrement and check of external name
refcount before the RCU delay.  We could do the decrement and
check in __d_free(), but that would demand an additional RCU
delay for freeing.  It's cheaper do decrement-and-check right
in dentry_free() and make the decision whether to free there.
Thus the two variants of __d_free() - one for "need to free
the external name", another for "no external name or not the
last reference to it".

In the scenario above the actual kernel gets ext->count to 1
in the dentry_free(d1) and schedules plain __d_free().  Then
when we rename d2 dropping the other reference gets ext->count
to 0 and we use kfree_rcu() to schedule its freeing.  And _that_
happens after ->d_name switch, so either d_path() doesn't see
ext at all, or we are guaranteed that RCU delay before freeing
ext has started after rcu_read_lock() has been done by d_path().
Al Viro April 11, 2019, 9:02 p.m. UTC | #6
On Thu, Apr 11, 2019 at 05:47:46AM +0100, Al Viro wrote:

> Note, BTW, that umount coming between isolate and drop is not a problem;
> it call shrink_dcache_parent() on the root.  And if shrink_dcache_parent()
> finds something on (another) shrink list, it won't put it to the shrink
> list of its own, but it will make note of that and repeat the scan in
> such case.  So if we find something with zero refcount and not on
> shrink list, we can move it to our shrink list and be sure that its
> superblock won't go away under us...

Aaaarrgghhh...  No, we can't.  Look: we get one candidate dentry in isolate
phase.  We put it into shrink list.  umount(2) comes and calls
shrink_dcache_for_umount(), which calls shrink_dcache_parent(root).
In the meanwhile, shrink_dentry_list() is run and does __dentry_kill() on
that one dentry.  Fine, it's gone - before shrink_dcache_parent() even
sees it.  Now shrink_dentry_list() holds a reference to its parent and
is about to drop it in
                dentry = parent;
                while (dentry && !lockref_put_or_lock(&dentry->d_lockref))
                        dentry = dentry_kill(dentry);
And dropped it will be, but... shrink_dcache_parent() has finished the
scan, without finding *anything* with zero refcount - the thing that used
to be on the shrink list was already gone before shrink_dcache_parent()
has gotten there and the reference to parent was not dropped yet.  So
shrink_dcache_for_umount() plows past shrink_dcache_parent(), walks the
tree and complains loudly about "busy" dentries (that parent we hadn't
finished dropping), and then we proceed with filesystem shutdown.
In the meanwhile, dentry_kill() finally gets to killing dentry and
triggers an unexpected late call of ->d_iput() on a filesystem that
has already been far enough into shutdown - far enough to destroy the
data structures needed for that sucker.

The reason we don't hit that problem with regular memory shrinker is
this:
                unregister_shrinker(&s->s_shrink);
                fs->kill_sb(s);
in deactivate_locked_super().  IOW, shrinker for this fs is gone
before we get around to shutdown.  And so are all normal sources
of dentry eviction for that fs.

Your earlier variants all suffer the same problem - picking a page
shared by dentries from several superblocks can run into trouble
if it overlaps with umount of one of those.

Fuck...  One variant of solution would be to have per-superblock
struct kmem_cache to be used for dentries of that superblock.
However,
	* we'd need to prevent them getting merged
	* it would add per-superblock memory costs (for struct
kmem_cache and associated structures)
	* it might mean more pages eaten by the dentries -
on average half a page per superblock (more if there are very
few dentries on that superblock)

OTOH, it might actually improve the memory footprint - all
dentries sharing a page would be from the same superblock,
so the use patterns might be more similar, which might
lower the fragmentation...

Hell knows...  I'd like to hear an opinion from VM folks on
that one.  Comments?
Al Viro June 29, 2019, 4:08 a.m. UTC | #7
On Thu, Apr 11, 2019 at 10:02:00PM +0100, Al Viro wrote:

> Aaaarrgghhh...  No, we can't.  Look: we get one candidate dentry in isolate
> phase.  We put it into shrink list.  umount(2) comes and calls
> shrink_dcache_for_umount(), which calls shrink_dcache_parent(root).
> In the meanwhile, shrink_dentry_list() is run and does __dentry_kill() on
> that one dentry.  Fine, it's gone - before shrink_dcache_parent() even
> sees it.  Now shrink_dentry_list() holds a reference to its parent and
> is about to drop it in
>                 dentry = parent;
>                 while (dentry && !lockref_put_or_lock(&dentry->d_lockref))
>                         dentry = dentry_kill(dentry);
> And dropped it will be, but... shrink_dcache_parent() has finished the
> scan, without finding *anything* with zero refcount - the thing that used
> to be on the shrink list was already gone before shrink_dcache_parent()
> has gotten there and the reference to parent was not dropped yet.  So
> shrink_dcache_for_umount() plows past shrink_dcache_parent(), walks the
> tree and complains loudly about "busy" dentries (that parent we hadn't
> finished dropping), and then we proceed with filesystem shutdown.
> In the meanwhile, dentry_kill() finally gets to killing dentry and
> triggers an unexpected late call of ->d_iput() on a filesystem that
> has already been far enough into shutdown - far enough to destroy the
> data structures needed for that sucker.
> 
> The reason we don't hit that problem with regular memory shrinker is
> this:
>                 unregister_shrinker(&s->s_shrink);
>                 fs->kill_sb(s);
> in deactivate_locked_super().  IOW, shrinker for this fs is gone
> before we get around to shutdown.  And so are all normal sources
> of dentry eviction for that fs.
> 
> Your earlier variants all suffer the same problem - picking a page
> shared by dentries from several superblocks can run into trouble
> if it overlaps with umount of one of those.

FWIW, I think I see a kinda-sorta sane solution.  Namely, add

static void __dput_to_list(struct dentry *dentry, struct list_head *list)
{
	if (dentry->d_flags & DCACHE_SHRINK_LIST) {
		/* let the owner of the list it's on deal with it */
		--dentry->d_lockref.count;
	} else {
		if (dentry->d_flags & DCACHE_LRU_LIST)
			d_lru_del(dentry);
		if (!--dentry->d_lockref.count)
			d_shrink_add(parent, list);
	}
}

and have
shrink_dentry_list() do this in the end of loop:
                d_shrink_del(dentry);
                parent = dentry->d_parent;
		/* both dentry and parent are locked at that point */
		if (parent != dentry) {
			/*
			 * We need to prune ancestors too. This is necessary to
			 * prevent quadratic behavior of shrink_dcache_parent(),
			 * but is also expected to be beneficial in reducing
			 * dentry cache fragmentation.
			 */
			__dput_to_list(parent, list);
		}
		__dentry_kill(dentry);
        }

instead of
                d_shrink_del(dentry);
                parent = dentry->d_parent;
                __dentry_kill(dentry);
                if (parent == dentry)
                        continue;
                /*
                 * We need to prune ancestors too. This is necessary to prevent
                 * quadratic behavior of shrink_dcache_parent(), but is also
                 * expected to be beneficial in reducing dentry cache
                 * fragmentation.
                 */
                dentry = parent;
                while (dentry && !lockref_put_or_lock(&dentry->d_lockref))
                        dentry = dentry_kill(dentry);
        }
we have there now.  Linus, do you see any problems with that change?  AFAICS,
that should avoid the problem described above.  Moreover, it seems to allow
a fun API addition:

void dput_to_list(struct dentry *dentry, struct list_head *list)
{
	rcu_read_lock();
	if (likely(fast_dput(dentry))) {
		rcu_read_unlock();
		return;
	}
	rcu_read_unlock();
	if (!retain_dentry(dentry))
		__dput_to_list(dentry, list);
	spin_unlock(&dentry->d_lock);
}

allowing to take an empty list, do a bunch of dput_to_list() (under spinlocks,
etc.), then, once we are in better locking conditions, shrink_dentry_list()
to take them all out.  I can see applications for that in e.g. fs/namespace.c -
quite a bit of kludges with ->mnt_ex_mountpoint would be killable that way,
and there would be a chance to transfer the contribution to ->d_count of
mountpoint from struct mount to struct mountpoint (i.e. make any number of
mounts on the same mountpoint dentry contribute only 1 to its ->d_count,
not the number of such mounts).

Patch
diff mbox series

diff --git a/fs/dcache.c b/fs/dcache.c
index 606cfca20d42..5c707ed9ab5a 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -30,6 +30,7 @@ 
 #include <linux/bit_spinlock.h>
 #include <linux/rculist_bl.h>
 #include <linux/list_lru.h>
+#include <linux/backing-dev.h>
 #include "internal.h"
 #include "mount.h"
 
@@ -3068,6 +3069,74 @@  void d_tmpfile(struct dentry *dentry, struct inode *inode)
 }
 EXPORT_SYMBOL(d_tmpfile);
 
+/*
+ * d_isolate() - Dentry isolation callback function.
+ * @s: The dentry cache.
+ * @v: Vector of pointers to the objects to isolate.
+ * @nr: Number of objects in @v.
+ *
+ * The slab allocator is holding off frees. We can safely examine
+ * the object without the danger of it vanishing from under us.
+ */
+static void *d_isolate(struct kmem_cache *s, void **v, int nr)
+{
+	struct dentry *dentry;
+	int i;
+
+	for (i = 0; i < nr; i++) {
+		dentry = v[i];
+		__dget(dentry);
+	}
+
+	return NULL;		/* No need for private data */
+}
+
+/*
+ * d_partial_shrink() - Dentry migration callback function.
+ * @s: The dentry cache.
+ * @v: Vector of pointers to the objects to migrate.
+ * @nr: Number of objects in @v.
+ * @node: The NUMA node where new object should be allocated.
+ * @private: Returned by d_isolate() (currently %NULL).
+ *
+ * Dentry objects _can not_ be relocated and shrinking the whole dcache
+ * can be expensive.  This is an effort to free dentry objects that are
+ * stopping slab pages from being free'd without clearing the whole dcache.
+ *
+ * This callback is called from the SLUB allocator object migration
+ * infrastructure in attempt to free up slab pages by freeing dentry
+ * objects from partially full slabs.
+ */
+static void d_partial_shrink(struct kmem_cache *s, void **v, int nr,
+		      int node, void *_unused)
+{
+	struct dentry *dentry;
+	LIST_HEAD(dispose);
+	int i;
+
+	for (i = 0; i < nr; i++) {
+		dentry = v[i];
+		spin_lock(&dentry->d_lock);
+		dentry->d_lockref.count--;
+
+		if (dentry->d_lockref.count > 0 ||
+		    dentry->d_flags & DCACHE_SHRINK_LIST) {
+			spin_unlock(&dentry->d_lock);
+			continue;
+		}
+
+		if (dentry->d_flags & DCACHE_LRU_LIST)
+			d_lru_del(dentry);
+
+		d_shrink_add(dentry, &dispose);
+
+		spin_unlock(&dentry->d_lock);
+	}
+
+	if (!list_empty(&dispose))
+		shrink_dentry_list(&dispose);
+}
+
 static __initdata unsigned long dhash_entries;
 static int __init set_dhash_entries(char *str)
 {
@@ -3113,6 +3182,8 @@  static void __init dcache_init(void)
 					   sizeof_field(struct dentry, d_iname),
 					   dcache_ctor);
 
+	kmem_cache_setup_mobility(dentry_cache, d_isolate, d_partial_shrink);
+
 	/* Hash may have been set up in dcache_init_early */
 	if (!hashdist)
 		return;