diff mbox series

[RFC,09/15] slub: Enable slab defragmentation using SMO

Message ID 20190308041426.16654-10-tobin@kernel.org (mailing list archive)
State New, archived
Headers show
Series mm: Implement Slab Movable Objects (SMO) | expand

Commit Message

Tobin C. Harding March 8, 2019, 4:14 a.m. UTC
If many objects are allocated with the slab allocator and freed in an
arbitrary order then the slab caches can become internally fragmented.
Now that the slab allocator supports movable objects we can defragment
any cache that has this feature enabled.

Slab defragmentation may occur:

1. Unconditionally when __kmem_cache_shrink() is called on a slab cache
   by the kernel calling kmem_cache_shrink().

2. Unconditionally through the use of the slabinfo command.

	slabinfo <cache> -s

3. Conditionally via the use of kmem_cache_defrag()

Use SMO when shrinking cache.  Currently when the kernel calls
kmem_cache_shrink() we curate the partial slabs list.  If object
migration is not enabled for the cache we still do this, if however SMO
is enabled, we attempt to move objects in partially full slabs in order
to defragment the cache.  Shrink attempts to move all objects in order
to reduce the cache to a single partial slab for each node.

kmem_cache_defrag() differs from shrink in that it operates dependent on
the defrag_used_ratio and only attempts to move objects if the number of
partial slabs exceeds MAX_PARTIAL (for each node).

Add function kmem_cache_defrag(int node).

   kmem_cache_defrag() only performs defragmentation if the usage ratio
   of the slab is lower than the configured percentage (sysfs file added
   in previous patch).  Fragmentation ratios are measured by calculating
   the percentage of objects in use compared to the total number of
   objects that the slab page can accommodate.

   The scanning of slab caches is optimized because the defragmentable
   slabs come first on the list. Thus we can terminate scans on the
   first slab encountered that does not support defragmentation.

   kmem_cache_defrag() takes a node parameter. This can either be -1 if
   defragmentation should be performed on all nodes, or a node number.

   Defragmentation may be disabled by setting defrag ratio to 0

	echo 0 > /sys/kernel/slab/<cache>/defrag_used_ratio

In order for a cache to be defragmentable the cache must support object
migration (SMO).  Enabling SMO for a cache is done via a call to the
recently added function:

	void kmem_cache_setup_mobility(struct kmem_cache *,
				       kmem_cache_isolate_func,
			               kmem_cache_migrate_func);

Co-developed-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tobin C. Harding <tobin@kernel.org>
---
 include/linux/slab.h |   1 +
 mm/slub.c            | 266 +++++++++++++++++++++++++++++++------------
 2 files changed, 194 insertions(+), 73 deletions(-)

Comments

Roman Gushchin March 11, 2019, 11:35 p.m. UTC | #1
On Fri, Mar 08, 2019 at 03:14:20PM +1100, Tobin C. Harding wrote:
> If many objects are allocated with the slab allocator and freed in an
> arbitrary order then the slab caches can become internally fragmented.
> Now that the slab allocator supports movable objects we can defragment
> any cache that has this feature enabled.
> 
> Slab defragmentation may occur:
> 
> 1. Unconditionally when __kmem_cache_shrink() is called on a slab cache
>    by the kernel calling kmem_cache_shrink().
> 
> 2. Unconditionally through the use of the slabinfo command.
> 
> 	slabinfo <cache> -s
> 
> 3. Conditionally via the use of kmem_cache_defrag()
> 
> Use SMO when shrinking cache.  Currently when the kernel calls
> kmem_cache_shrink() we curate the partial slabs list.  If object
> migration is not enabled for the cache we still do this, if however SMO
> is enabled, we attempt to move objects in partially full slabs in order
> to defragment the cache.  Shrink attempts to move all objects in order
> to reduce the cache to a single partial slab for each node.
> 
> kmem_cache_defrag() differs from shrink in that it operates dependent on
> the defrag_used_ratio and only attempts to move objects if the number of
> partial slabs exceeds MAX_PARTIAL (for each node).
> 
> Add function kmem_cache_defrag(int node).
> 
>    kmem_cache_defrag() only performs defragmentation if the usage ratio
>    of the slab is lower than the configured percentage (sysfs file added
>    in previous patch).  Fragmentation ratios are measured by calculating
>    the percentage of objects in use compared to the total number of
>    objects that the slab page can accommodate.
> 
>    The scanning of slab caches is optimized because the defragmentable
>    slabs come first on the list. Thus we can terminate scans on the
>    first slab encountered that does not support defragmentation.
> 
>    kmem_cache_defrag() takes a node parameter. This can either be -1 if
>    defragmentation should be performed on all nodes, or a node number.
> 
>    Defragmentation may be disabled by setting defrag ratio to 0
> 
> 	echo 0 > /sys/kernel/slab/<cache>/defrag_used_ratio
> 
> In order for a cache to be defragmentable the cache must support object
> migration (SMO).  Enabling SMO for a cache is done via a call to the
> recently added function:
> 
> 	void kmem_cache_setup_mobility(struct kmem_cache *,
> 				       kmem_cache_isolate_func,
> 			               kmem_cache_migrate_func);
> 
> Co-developed-by: Christoph Lameter <cl@linux.com>
> Signed-off-by: Tobin C. Harding <tobin@kernel.org>
> ---
>  include/linux/slab.h |   1 +
>  mm/slub.c            | 266 +++++++++++++++++++++++++++++++------------
>  2 files changed, 194 insertions(+), 73 deletions(-)
> 
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 22e87c41b8a4..b9b46bc9937e 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -147,6 +147,7 @@ struct kmem_cache *kmem_cache_create_usercopy(const char *name,
>  			void (*ctor)(void *));
>  void kmem_cache_destroy(struct kmem_cache *);
>  int kmem_cache_shrink(struct kmem_cache *);
> +int kmem_cache_defrag(int node);
>  
>  void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *);
>  void memcg_deactivate_kmem_caches(struct mem_cgroup *);
> diff --git a/mm/slub.c b/mm/slub.c
> index 515db0f36c55..53dd4cb5b5a4 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -354,6 +354,12 @@ static __always_inline void slab_lock(struct page *page)
>  	bit_spin_lock(PG_locked, &page->flags);
>  }
>  
> +static __always_inline int slab_trylock(struct page *page)
> +{
> +	VM_BUG_ON_PAGE(PageTail(page), page);
> +	return bit_spin_trylock(PG_locked, &page->flags);
> +}
> +
>  static __always_inline void slab_unlock(struct page *page)
>  {
>  	VM_BUG_ON_PAGE(PageTail(page), page);
> @@ -3959,79 +3965,6 @@ void kfree(const void *x)
>  }
>  EXPORT_SYMBOL(kfree);
>  
> -#define SHRINK_PROMOTE_MAX 32
> -
> -/*
> - * kmem_cache_shrink discards empty slabs and promotes the slabs filled
> - * up most to the head of the partial lists. New allocations will then
> - * fill those up and thus they can be removed from the partial lists.
> - *
> - * The slabs with the least items are placed last. This results in them
> - * being allocated from last increasing the chance that the last objects
> - * are freed in them.
> - */
> -int __kmem_cache_shrink(struct kmem_cache *s)
> -{
> -	int node;
> -	int i;
> -	struct kmem_cache_node *n;
> -	struct page *page;
> -	struct page *t;
> -	struct list_head discard;
> -	struct list_head promote[SHRINK_PROMOTE_MAX];
> -	unsigned long flags;
> -	int ret = 0;
> -
> -	flush_all(s);
> -	for_each_kmem_cache_node(s, node, n) {
> -		INIT_LIST_HEAD(&discard);
> -		for (i = 0; i < SHRINK_PROMOTE_MAX; i++)
> -			INIT_LIST_HEAD(promote + i);
> -
> -		spin_lock_irqsave(&n->list_lock, flags);
> -
> -		/*
> -		 * Build lists of slabs to discard or promote.
> -		 *
> -		 * Note that concurrent frees may occur while we hold the
> -		 * list_lock. page->inuse here is the upper limit.
> -		 */
> -		list_for_each_entry_safe(page, t, &n->partial, lru) {
> -			int free = page->objects - page->inuse;
> -
> -			/* Do not reread page->inuse */
> -			barrier();
> -
> -			/* We do not keep full slabs on the list */
> -			BUG_ON(free <= 0);
> -
> -			if (free == page->objects) {
> -				list_move(&page->lru, &discard);
> -				n->nr_partial--;
> -			} else if (free <= SHRINK_PROMOTE_MAX)
> -				list_move(&page->lru, promote + free - 1);
> -		}
> -
> -		/*
> -		 * Promote the slabs filled up most to the head of the
> -		 * partial list.
> -		 */
> -		for (i = SHRINK_PROMOTE_MAX - 1; i >= 0; i--)
> -			list_splice(promote + i, &n->partial);
> -
> -		spin_unlock_irqrestore(&n->list_lock, flags);
> -
> -		/* Release empty slabs */
> -		list_for_each_entry_safe(page, t, &discard, lru)
> -			discard_slab(s, page);
> -
> -		if (slabs_node(s, node))
> -			ret = 1;
> -	}
> -
> -	return ret;
> -}
> -
>  #ifdef CONFIG_MEMCG
>  static void kmemcg_cache_deact_after_rcu(struct kmem_cache *s)
>  {
> @@ -4411,6 +4344,193 @@ static void __move(struct page *page, void *scratch, int node)
>  	s->migrate(s, vector, count, node, private);
>  }
>  
> +/*
> + * __defrag() - Defragment node.
> + * @s: cache we are working on.
> + * @node: The node to move objects from.
> + * @target_node: The node to move objects to.
> + * @ratio: The defrag ratio (percentage, between 0 and 100).
> + *
> + * Release slabs with zero objects and try to call the migration function
> + * for slabs with less than the 'ratio' percentage of objects allocated.
> + *
> + * Moved objects are allocated on @target_node.
> + *
> + * Return: The number of partial slabs left on the node after the operation.
> + */
> +static unsigned long __defrag(struct kmem_cache *s, int node, int target_node,
> +			      int ratio)

Maybe kmem_cache_defrag_node()?

> +{
> +	struct kmem_cache_node *n = get_node(s, node);
> +	struct page *page, *page2;
> +	LIST_HEAD(move_list);
> +	unsigned long flags;
> +
> +	if (node == target_node && n->nr_partial <= 1) {
> +		/*
> +		 * Trying to reduce fragmentation on a node but there is
> +		 * only a single or no partial slab page. This is already
> +		 * the optimal object density that we can reach.
> +		 */
> +		return n->nr_partial;
> +	}
> +
> +	spin_lock_irqsave(&n->list_lock, flags);
> +	list_for_each_entry_safe(page, page2, &n->partial, lru) {
> +		if (!slab_trylock(page))
> +			/* Busy slab. Get out of the way */
> +			continue;
> +
> +		if (page->inuse) {
> +			if (page->inuse > ratio * page->objects / 100) {
> +				slab_unlock(page);
> +				/*
> +				 * Skip slab because the object density
> +				 * in the slab page is high enough.
> +				 */
> +				continue;
> +			}
> +
> +			list_move(&page->lru, &move_list);
> +			if (s->migrate) {
> +				/* Stop page being considered for allocations */
> +				n->nr_partial--;
> +				page->frozen = 1;
> +			}
> +			slab_unlock(page);
> +		} else {	/* Empty slab page */
> +			list_del(&page->lru);
> +			n->nr_partial--;
> +			slab_unlock(page);
> +			discard_slab(s, page);
> +		}
> +	}
> +
> +	if (!s->migrate) {
> +		/*
> +		 * No defrag method. By simply putting the zaplist at the
> +		 * end of the partial list we can let them simmer longer
> +		 * and thus increase the chance of all objects being
> +		 * reclaimed.
> +		 *
> +		 */
> +		list_splice(&move_list, n->partial.prev);
> +	}
> +
> +	spin_unlock_irqrestore(&n->list_lock, flags);
> +
> +	if (s->migrate && !list_empty(&move_list)) {
> +		void **scratch = alloc_scratch(s);
> +		struct page *page, *page2;
> +
> +		if (scratch) {
> +			/* Try to remove / move the objects left */
> +			list_for_each_entry(page, &move_list, lru) {
> +				if (page->inuse)
> +					__move(page, scratch, target_node);
> +			}
> +			kfree(scratch);
> +		}
> +
> +		/* Inspect results and dispose of pages */
> +		spin_lock_irqsave(&n->list_lock, flags);
> +		list_for_each_entry_safe(page, page2, &move_list, lru) {
> +			list_del(&page->lru);
> +			slab_lock(page);
> +			page->frozen = 0;
> +
> +			if (page->inuse) {
> +				/*
> +				 * Objects left in slab page, move it to the
> +				 * tail of the partial list to increase the
> +				 * chance that the freeing of the remaining
> +				 * objects will free the slab page.
> +				 */
> +				n->nr_partial++;
> +				list_add_tail(&page->lru, &n->partial);
> +				slab_unlock(page);
> +			} else {
> +				slab_unlock(page);
> +				discard_slab(s, page);
> +			}
> +		}
> +		spin_unlock_irqrestore(&n->list_lock, flags);
> +	}
> +
> +	return n->nr_partial;
> +}
> +
> +/**
> + * kmem_cache_defrag() - Defrag slab caches.
> + * @node: The node to defrag or -1 for all nodes.
> + *
> + * Defrag slabs conditional on the amount of fragmentation in a page.
> + */
> +int kmem_cache_defrag(int node)
> +{
> +	struct kmem_cache *s;
> +	unsigned long left = 0;
> +
> +	/*
> +	 * kmem_cache_defrag may be called from the reclaim path which may be
> +	 * called for any page allocator alloc. So there is the danger that we
> +	 * get called in a situation where slub already acquired the slub_lock
> +	 * for other purposes.
> +	 */
> +	if (!mutex_trylock(&slab_mutex))
> +		return 0;
> +
> +	list_for_each_entry(s, &slab_caches, list) {
> +		/*
> +		 * Defragmentable caches come first. If the slab cache is not
> +		 * defragmentable then we can stop traversing the list.
> +		 */
> +		if (!s->migrate)
> +			break;
> +
> +		if (node == -1) {
> +			int nid;
> +
> +			for_each_node_state(nid, N_NORMAL_MEMORY)
> +				if (s->node[nid]->nr_partial > MAX_PARTIAL)
> +					left += __defrag(s, nid, nid, s->defrag_used_ratio);
> +		} else {
> +			if (s->node[node]->nr_partial > MAX_PARTIAL)
> +				left += __defrag(s, node, node, s->defrag_used_ratio);
> +		}
> +	}
> +	mutex_unlock(&slab_mutex);
> +	return left;
> +}
> +EXPORT_SYMBOL(kmem_cache_defrag);
> +
> +/**
> + * __kmem_cache_shrink() - Shrink a cache.
> + * @s: The cache to shrink.
> + *
> + * Reduces the memory footprint of a slab cache by as much as possible.
> + *
> + * This works by:
> + *  1. Removing empty slabs from the partial list.
> + *  2. Migrating slab objects to denser slab pages if the slab cache
> + *  supports migration.  If not, reorganizing the partial list so that
> + *  more densely allocated slab pages come first.
> + *
> + * Not called directly, called by kmem_cache_shrink().
> + */
> +int __kmem_cache_shrink(struct kmem_cache *s)
> +{
> +	int node;
> +	int left = 0;

s/int/unsigned long? Or s/unsigned long/int in __defrag()?

Thanks!
Tobin Harding March 12, 2019, 1:49 a.m. UTC | #2
On Mon, Mar 11, 2019 at 11:35:29PM +0000, Roman Gushchin wrote:
> On Fri, Mar 08, 2019 at 03:14:20PM +1100, Tobin C. Harding wrote:
> > If many objects are allocated with the slab allocator and freed in an
> > arbitrary order then the slab caches can become internally fragmented.
> > Now that the slab allocator supports movable objects we can defragment
> > any cache that has this feature enabled.
> > 
> > Slab defragmentation may occur:
> > 
> > 1. Unconditionally when __kmem_cache_shrink() is called on a slab cache
> >    by the kernel calling kmem_cache_shrink().
> > 
> > 2. Unconditionally through the use of the slabinfo command.
> > 
> > 	slabinfo <cache> -s
> > 
> > 3. Conditionally via the use of kmem_cache_defrag()
> > 
> > Use SMO when shrinking cache.  Currently when the kernel calls
> > kmem_cache_shrink() we curate the partial slabs list.  If object
> > migration is not enabled for the cache we still do this, if however SMO
> > is enabled, we attempt to move objects in partially full slabs in order
> > to defragment the cache.  Shrink attempts to move all objects in order
> > to reduce the cache to a single partial slab for each node.
> > 
> > kmem_cache_defrag() differs from shrink in that it operates dependent on
> > the defrag_used_ratio and only attempts to move objects if the number of
> > partial slabs exceeds MAX_PARTIAL (for each node).
> > 
> > Add function kmem_cache_defrag(int node).
> > 
> >    kmem_cache_defrag() only performs defragmentation if the usage ratio
> >    of the slab is lower than the configured percentage (sysfs file added
> >    in previous patch).  Fragmentation ratios are measured by calculating
> >    the percentage of objects in use compared to the total number of
> >    objects that the slab page can accommodate.
> > 
> >    The scanning of slab caches is optimized because the defragmentable
> >    slabs come first on the list. Thus we can terminate scans on the
> >    first slab encountered that does not support defragmentation.
> > 
> >    kmem_cache_defrag() takes a node parameter. This can either be -1 if
> >    defragmentation should be performed on all nodes, or a node number.
> > 
> >    Defragmentation may be disabled by setting defrag ratio to 0
> > 
> > 	echo 0 > /sys/kernel/slab/<cache>/defrag_used_ratio
> > 
> > In order for a cache to be defragmentable the cache must support object
> > migration (SMO).  Enabling SMO for a cache is done via a call to the
> > recently added function:
> > 
> > 	void kmem_cache_setup_mobility(struct kmem_cache *,
> > 				       kmem_cache_isolate_func,
> > 			               kmem_cache_migrate_func);
> > 
> > Co-developed-by: Christoph Lameter <cl@linux.com>
> > Signed-off-by: Tobin C. Harding <tobin@kernel.org>
> > ---
> >  include/linux/slab.h |   1 +
> >  mm/slub.c            | 266 +++++++++++++++++++++++++++++++------------
> >  2 files changed, 194 insertions(+), 73 deletions(-)
> > 
> > diff --git a/include/linux/slab.h b/include/linux/slab.h
> > index 22e87c41b8a4..b9b46bc9937e 100644
> > --- a/include/linux/slab.h
> > +++ b/include/linux/slab.h
> > @@ -147,6 +147,7 @@ struct kmem_cache *kmem_cache_create_usercopy(const char *name,
> >  			void (*ctor)(void *));
> >  void kmem_cache_destroy(struct kmem_cache *);
> >  int kmem_cache_shrink(struct kmem_cache *);
> > +int kmem_cache_defrag(int node);
> >  
> >  void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *);
> >  void memcg_deactivate_kmem_caches(struct mem_cgroup *);
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 515db0f36c55..53dd4cb5b5a4 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -354,6 +354,12 @@ static __always_inline void slab_lock(struct page *page)
> >  	bit_spin_lock(PG_locked, &page->flags);
> >  }
> >  
> > +static __always_inline int slab_trylock(struct page *page)
> > +{
> > +	VM_BUG_ON_PAGE(PageTail(page), page);
> > +	return bit_spin_trylock(PG_locked, &page->flags);
> > +}
> > +
> >  static __always_inline void slab_unlock(struct page *page)
> >  {
> >  	VM_BUG_ON_PAGE(PageTail(page), page);
> > @@ -3959,79 +3965,6 @@ void kfree(const void *x)
> >  }
> >  EXPORT_SYMBOL(kfree);
> >  
> > -#define SHRINK_PROMOTE_MAX 32
> > -
> > -/*
> > - * kmem_cache_shrink discards empty slabs and promotes the slabs filled
> > - * up most to the head of the partial lists. New allocations will then
> > - * fill those up and thus they can be removed from the partial lists.
> > - *
> > - * The slabs with the least items are placed last. This results in them
> > - * being allocated from last increasing the chance that the last objects
> > - * are freed in them.
> > - */
> > -int __kmem_cache_shrink(struct kmem_cache *s)
> > -{
> > -	int node;
> > -	int i;
> > -	struct kmem_cache_node *n;
> > -	struct page *page;
> > -	struct page *t;
> > -	struct list_head discard;
> > -	struct list_head promote[SHRINK_PROMOTE_MAX];
> > -	unsigned long flags;
> > -	int ret = 0;
> > -
> > -	flush_all(s);
> > -	for_each_kmem_cache_node(s, node, n) {
> > -		INIT_LIST_HEAD(&discard);
> > -		for (i = 0; i < SHRINK_PROMOTE_MAX; i++)
> > -			INIT_LIST_HEAD(promote + i);
> > -
> > -		spin_lock_irqsave(&n->list_lock, flags);
> > -
> > -		/*
> > -		 * Build lists of slabs to discard or promote.
> > -		 *
> > -		 * Note that concurrent frees may occur while we hold the
> > -		 * list_lock. page->inuse here is the upper limit.
> > -		 */
> > -		list_for_each_entry_safe(page, t, &n->partial, lru) {
> > -			int free = page->objects - page->inuse;
> > -
> > -			/* Do not reread page->inuse */
> > -			barrier();
> > -
> > -			/* We do not keep full slabs on the list */
> > -			BUG_ON(free <= 0);
> > -
> > -			if (free == page->objects) {
> > -				list_move(&page->lru, &discard);
> > -				n->nr_partial--;
> > -			} else if (free <= SHRINK_PROMOTE_MAX)
> > -				list_move(&page->lru, promote + free - 1);
> > -		}
> > -
> > -		/*
> > -		 * Promote the slabs filled up most to the head of the
> > -		 * partial list.
> > -		 */
> > -		for (i = SHRINK_PROMOTE_MAX - 1; i >= 0; i--)
> > -			list_splice(promote + i, &n->partial);
> > -
> > -		spin_unlock_irqrestore(&n->list_lock, flags);
> > -
> > -		/* Release empty slabs */
> > -		list_for_each_entry_safe(page, t, &discard, lru)
> > -			discard_slab(s, page);
> > -
> > -		if (slabs_node(s, node))
> > -			ret = 1;
> > -	}
> > -
> > -	return ret;
> > -}
> > -
> >  #ifdef CONFIG_MEMCG
> >  static void kmemcg_cache_deact_after_rcu(struct kmem_cache *s)
> >  {
> > @@ -4411,6 +4344,193 @@ static void __move(struct page *page, void *scratch, int node)
> >  	s->migrate(s, vector, count, node, private);
> >  }
> >  
> > +/*
> > + * __defrag() - Defragment node.
> > + * @s: cache we are working on.
> > + * @node: The node to move objects from.
> > + * @target_node: The node to move objects to.
> > + * @ratio: The defrag ratio (percentage, between 0 and 100).
> > + *
> > + * Release slabs with zero objects and try to call the migration function
> > + * for slabs with less than the 'ratio' percentage of objects allocated.
> > + *
> > + * Moved objects are allocated on @target_node.
> > + *
> > + * Return: The number of partial slabs left on the node after the operation.
> > + */
> > +static unsigned long __defrag(struct kmem_cache *s, int node, int target_node,
> > +			      int ratio)
> 
> Maybe kmem_cache_defrag_node()?
> 
> > +{
> > +	struct kmem_cache_node *n = get_node(s, node);
> > +	struct page *page, *page2;
> > +	LIST_HEAD(move_list);
> > +	unsigned long flags;
> > +
> > +	if (node == target_node && n->nr_partial <= 1) {
> > +		/*
> > +		 * Trying to reduce fragmentation on a node but there is
> > +		 * only a single or no partial slab page. This is already
> > +		 * the optimal object density that we can reach.
> > +		 */
> > +		return n->nr_partial;
> > +	}
> > +
> > +	spin_lock_irqsave(&n->list_lock, flags);
> > +	list_for_each_entry_safe(page, page2, &n->partial, lru) {
> > +		if (!slab_trylock(page))
> > +			/* Busy slab. Get out of the way */
> > +			continue;
> > +
> > +		if (page->inuse) {
> > +			if (page->inuse > ratio * page->objects / 100) {
> > +				slab_unlock(page);
> > +				/*
> > +				 * Skip slab because the object density
> > +				 * in the slab page is high enough.
> > +				 */
> > +				continue;
> > +			}
> > +
> > +			list_move(&page->lru, &move_list);
> > +			if (s->migrate) {
> > +				/* Stop page being considered for allocations */
> > +				n->nr_partial--;
> > +				page->frozen = 1;
> > +			}
> > +			slab_unlock(page);
> > +		} else {	/* Empty slab page */
> > +			list_del(&page->lru);
> > +			n->nr_partial--;
> > +			slab_unlock(page);
> > +			discard_slab(s, page);
> > +		}
> > +	}
> > +
> > +	if (!s->migrate) {
> > +		/*
> > +		 * No defrag method. By simply putting the zaplist at the
> > +		 * end of the partial list we can let them simmer longer
> > +		 * and thus increase the chance of all objects being
> > +		 * reclaimed.
> > +		 *
> > +		 */
> > +		list_splice(&move_list, n->partial.prev);
> > +	}
> > +
> > +	spin_unlock_irqrestore(&n->list_lock, flags);
> > +
> > +	if (s->migrate && !list_empty(&move_list)) {
> > +		void **scratch = alloc_scratch(s);
> > +		struct page *page, *page2;
> > +
> > +		if (scratch) {
> > +			/* Try to remove / move the objects left */
> > +			list_for_each_entry(page, &move_list, lru) {
> > +				if (page->inuse)
> > +					__move(page, scratch, target_node);
> > +			}
> > +			kfree(scratch);
> > +		}
> > +
> > +		/* Inspect results and dispose of pages */
> > +		spin_lock_irqsave(&n->list_lock, flags);
> > +		list_for_each_entry_safe(page, page2, &move_list, lru) {
> > +			list_del(&page->lru);
> > +			slab_lock(page);
> > +			page->frozen = 0;
> > +
> > +			if (page->inuse) {
> > +				/*
> > +				 * Objects left in slab page, move it to the
> > +				 * tail of the partial list to increase the
> > +				 * chance that the freeing of the remaining
> > +				 * objects will free the slab page.
> > +				 */
> > +				n->nr_partial++;
> > +				list_add_tail(&page->lru, &n->partial);
> > +				slab_unlock(page);
> > +			} else {
> > +				slab_unlock(page);
> > +				discard_slab(s, page);
> > +			}
> > +		}
> > +		spin_unlock_irqrestore(&n->list_lock, flags);
> > +	}
> > +
> > +	return n->nr_partial;
> > +}
> > +
> > +/**
> > + * kmem_cache_defrag() - Defrag slab caches.
> > + * @node: The node to defrag or -1 for all nodes.
> > + *
> > + * Defrag slabs conditional on the amount of fragmentation in a page.
> > + */
> > +int kmem_cache_defrag(int node)
> > +{
> > +	struct kmem_cache *s;
> > +	unsigned long left = 0;
> > +
> > +	/*
> > +	 * kmem_cache_defrag may be called from the reclaim path which may be
> > +	 * called for any page allocator alloc. So there is the danger that we
> > +	 * get called in a situation where slub already acquired the slub_lock
> > +	 * for other purposes.
> > +	 */
> > +	if (!mutex_trylock(&slab_mutex))
> > +		return 0;
> > +
> > +	list_for_each_entry(s, &slab_caches, list) {
> > +		/*
> > +		 * Defragmentable caches come first. If the slab cache is not
> > +		 * defragmentable then we can stop traversing the list.
> > +		 */
> > +		if (!s->migrate)
> > +			break;
> > +
> > +		if (node == -1) {
> > +			int nid;
> > +
> > +			for_each_node_state(nid, N_NORMAL_MEMORY)
> > +				if (s->node[nid]->nr_partial > MAX_PARTIAL)
> > +					left += __defrag(s, nid, nid, s->defrag_used_ratio);
> > +		} else {
> > +			if (s->node[node]->nr_partial > MAX_PARTIAL)
> > +				left += __defrag(s, node, node, s->defrag_used_ratio);
> > +		}
> > +	}
> > +	mutex_unlock(&slab_mutex);
> > +	return left;
> > +}
> > +EXPORT_SYMBOL(kmem_cache_defrag);
> > +
> > +/**
> > + * __kmem_cache_shrink() - Shrink a cache.
> > + * @s: The cache to shrink.
> > + *
> > + * Reduces the memory footprint of a slab cache by as much as possible.
> > + *
> > + * This works by:
> > + *  1. Removing empty slabs from the partial list.
> > + *  2. Migrating slab objects to denser slab pages if the slab cache
> > + *  supports migration.  If not, reorganizing the partial list so that
> > + *  more densely allocated slab pages come first.
> > + *
> > + * Not called directly, called by kmem_cache_shrink().
> > + */
> > +int __kmem_cache_shrink(struct kmem_cache *s)
> > +{
> > +	int node;
> > +	int left = 0;
> 
> s/int/unsigned long? Or s/unsigned long/int in __defrag()?

Nice catch, thank you.

     Tobin
diff mbox series

Patch

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 22e87c41b8a4..b9b46bc9937e 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -147,6 +147,7 @@  struct kmem_cache *kmem_cache_create_usercopy(const char *name,
 			void (*ctor)(void *));
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
+int kmem_cache_defrag(int node);
 
 void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *);
 void memcg_deactivate_kmem_caches(struct mem_cgroup *);
diff --git a/mm/slub.c b/mm/slub.c
index 515db0f36c55..53dd4cb5b5a4 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -354,6 +354,12 @@  static __always_inline void slab_lock(struct page *page)
 	bit_spin_lock(PG_locked, &page->flags);
 }
 
+static __always_inline int slab_trylock(struct page *page)
+{
+	VM_BUG_ON_PAGE(PageTail(page), page);
+	return bit_spin_trylock(PG_locked, &page->flags);
+}
+
 static __always_inline void slab_unlock(struct page *page)
 {
 	VM_BUG_ON_PAGE(PageTail(page), page);
@@ -3959,79 +3965,6 @@  void kfree(const void *x)
 }
 EXPORT_SYMBOL(kfree);
 
-#define SHRINK_PROMOTE_MAX 32
-
-/*
- * kmem_cache_shrink discards empty slabs and promotes the slabs filled
- * up most to the head of the partial lists. New allocations will then
- * fill those up and thus they can be removed from the partial lists.
- *
- * The slabs with the least items are placed last. This results in them
- * being allocated from last increasing the chance that the last objects
- * are freed in them.
- */
-int __kmem_cache_shrink(struct kmem_cache *s)
-{
-	int node;
-	int i;
-	struct kmem_cache_node *n;
-	struct page *page;
-	struct page *t;
-	struct list_head discard;
-	struct list_head promote[SHRINK_PROMOTE_MAX];
-	unsigned long flags;
-	int ret = 0;
-
-	flush_all(s);
-	for_each_kmem_cache_node(s, node, n) {
-		INIT_LIST_HEAD(&discard);
-		for (i = 0; i < SHRINK_PROMOTE_MAX; i++)
-			INIT_LIST_HEAD(promote + i);
-
-		spin_lock_irqsave(&n->list_lock, flags);
-
-		/*
-		 * Build lists of slabs to discard or promote.
-		 *
-		 * Note that concurrent frees may occur while we hold the
-		 * list_lock. page->inuse here is the upper limit.
-		 */
-		list_for_each_entry_safe(page, t, &n->partial, lru) {
-			int free = page->objects - page->inuse;
-
-			/* Do not reread page->inuse */
-			barrier();
-
-			/* We do not keep full slabs on the list */
-			BUG_ON(free <= 0);
-
-			if (free == page->objects) {
-				list_move(&page->lru, &discard);
-				n->nr_partial--;
-			} else if (free <= SHRINK_PROMOTE_MAX)
-				list_move(&page->lru, promote + free - 1);
-		}
-
-		/*
-		 * Promote the slabs filled up most to the head of the
-		 * partial list.
-		 */
-		for (i = SHRINK_PROMOTE_MAX - 1; i >= 0; i--)
-			list_splice(promote + i, &n->partial);
-
-		spin_unlock_irqrestore(&n->list_lock, flags);
-
-		/* Release empty slabs */
-		list_for_each_entry_safe(page, t, &discard, lru)
-			discard_slab(s, page);
-
-		if (slabs_node(s, node))
-			ret = 1;
-	}
-
-	return ret;
-}
-
 #ifdef CONFIG_MEMCG
 static void kmemcg_cache_deact_after_rcu(struct kmem_cache *s)
 {
@@ -4411,6 +4344,193 @@  static void __move(struct page *page, void *scratch, int node)
 	s->migrate(s, vector, count, node, private);
 }
 
+/*
+ * __defrag() - Defragment node.
+ * @s: cache we are working on.
+ * @node: The node to move objects from.
+ * @target_node: The node to move objects to.
+ * @ratio: The defrag ratio (percentage, between 0 and 100).
+ *
+ * Release slabs with zero objects and try to call the migration function
+ * for slabs with less than the 'ratio' percentage of objects allocated.
+ *
+ * Moved objects are allocated on @target_node.
+ *
+ * Return: The number of partial slabs left on the node after the operation.
+ */
+static unsigned long __defrag(struct kmem_cache *s, int node, int target_node,
+			      int ratio)
+{
+	struct kmem_cache_node *n = get_node(s, node);
+	struct page *page, *page2;
+	LIST_HEAD(move_list);
+	unsigned long flags;
+
+	if (node == target_node && n->nr_partial <= 1) {
+		/*
+		 * Trying to reduce fragmentation on a node but there is
+		 * only a single or no partial slab page. This is already
+		 * the optimal object density that we can reach.
+		 */
+		return n->nr_partial;
+	}
+
+	spin_lock_irqsave(&n->list_lock, flags);
+	list_for_each_entry_safe(page, page2, &n->partial, lru) {
+		if (!slab_trylock(page))
+			/* Busy slab. Get out of the way */
+			continue;
+
+		if (page->inuse) {
+			if (page->inuse > ratio * page->objects / 100) {
+				slab_unlock(page);
+				/*
+				 * Skip slab because the object density
+				 * in the slab page is high enough.
+				 */
+				continue;
+			}
+
+			list_move(&page->lru, &move_list);
+			if (s->migrate) {
+				/* Stop page being considered for allocations */
+				n->nr_partial--;
+				page->frozen = 1;
+			}
+			slab_unlock(page);
+		} else {	/* Empty slab page */
+			list_del(&page->lru);
+			n->nr_partial--;
+			slab_unlock(page);
+			discard_slab(s, page);
+		}
+	}
+
+	if (!s->migrate) {
+		/*
+		 * No defrag method. By simply putting the zaplist at the
+		 * end of the partial list we can let them simmer longer
+		 * and thus increase the chance of all objects being
+		 * reclaimed.
+		 *
+		 */
+		list_splice(&move_list, n->partial.prev);
+	}
+
+	spin_unlock_irqrestore(&n->list_lock, flags);
+
+	if (s->migrate && !list_empty(&move_list)) {
+		void **scratch = alloc_scratch(s);
+		struct page *page, *page2;
+
+		if (scratch) {
+			/* Try to remove / move the objects left */
+			list_for_each_entry(page, &move_list, lru) {
+				if (page->inuse)
+					__move(page, scratch, target_node);
+			}
+			kfree(scratch);
+		}
+
+		/* Inspect results and dispose of pages */
+		spin_lock_irqsave(&n->list_lock, flags);
+		list_for_each_entry_safe(page, page2, &move_list, lru) {
+			list_del(&page->lru);
+			slab_lock(page);
+			page->frozen = 0;
+
+			if (page->inuse) {
+				/*
+				 * Objects left in slab page, move it to the
+				 * tail of the partial list to increase the
+				 * chance that the freeing of the remaining
+				 * objects will free the slab page.
+				 */
+				n->nr_partial++;
+				list_add_tail(&page->lru, &n->partial);
+				slab_unlock(page);
+			} else {
+				slab_unlock(page);
+				discard_slab(s, page);
+			}
+		}
+		spin_unlock_irqrestore(&n->list_lock, flags);
+	}
+
+	return n->nr_partial;
+}
+
+/**
+ * kmem_cache_defrag() - Defrag slab caches.
+ * @node: The node to defrag or -1 for all nodes.
+ *
+ * Defrag slabs conditional on the amount of fragmentation in a page.
+ */
+int kmem_cache_defrag(int node)
+{
+	struct kmem_cache *s;
+	unsigned long left = 0;
+
+	/*
+	 * kmem_cache_defrag may be called from the reclaim path which may be
+	 * called for any page allocator alloc. So there is the danger that we
+	 * get called in a situation where slub already acquired the slub_lock
+	 * for other purposes.
+	 */
+	if (!mutex_trylock(&slab_mutex))
+		return 0;
+
+	list_for_each_entry(s, &slab_caches, list) {
+		/*
+		 * Defragmentable caches come first. If the slab cache is not
+		 * defragmentable then we can stop traversing the list.
+		 */
+		if (!s->migrate)
+			break;
+
+		if (node == -1) {
+			int nid;
+
+			for_each_node_state(nid, N_NORMAL_MEMORY)
+				if (s->node[nid]->nr_partial > MAX_PARTIAL)
+					left += __defrag(s, nid, nid, s->defrag_used_ratio);
+		} else {
+			if (s->node[node]->nr_partial > MAX_PARTIAL)
+				left += __defrag(s, node, node, s->defrag_used_ratio);
+		}
+	}
+	mutex_unlock(&slab_mutex);
+	return left;
+}
+EXPORT_SYMBOL(kmem_cache_defrag);
+
+/**
+ * __kmem_cache_shrink() - Shrink a cache.
+ * @s: The cache to shrink.
+ *
+ * Reduces the memory footprint of a slab cache by as much as possible.
+ *
+ * This works by:
+ *  1. Removing empty slabs from the partial list.
+ *  2. Migrating slab objects to denser slab pages if the slab cache
+ *  supports migration.  If not, reorganizing the partial list so that
+ *  more densely allocated slab pages come first.
+ *
+ * Not called directly, called by kmem_cache_shrink().
+ */
+int __kmem_cache_shrink(struct kmem_cache *s)
+{
+	int node;
+	int left = 0;
+
+	flush_all(s);
+	for_each_node_state(node, N_NORMAL_MEMORY)
+		left += __defrag(s, node, node, 100);
+
+	return left;
+}
+EXPORT_SYMBOL(__kmem_cache_shrink);
+
 void kmem_cache_setup_mobility(struct kmem_cache *s,
 			       kmem_cache_isolate_func isolate,
 			       kmem_cache_migrate_func migrate)