diff mbox

slab: introduce the flag SLAB_MINIMIZE_WASTE

Message ID alpine.LRH.2.02.1803200954590.18995@file01.intranet.prod.int.rdu2.redhat.com (mailing list archive)
State Superseded, archived
Delegated to: Mike Snitzer
Headers show

Commit Message

Mikulas Patocka March 20, 2018, 5:25 p.m. UTC
Hi

I'm submitting this patch for the slab allocator. The patch adds a new 
flag SLAB_MINIMIZE_WASTE. When this flag is present, the slab subsystem 
will use higher-order allocations to minimize wasted space (it will not 
use higher-order allocations when there's no benefit, such as if the 
object size is a power of two).

The reason why we need this is that we are going to merge code that does 
block device deduplication (it was developed separatedly and sold as a 
commercial product), and the code uses block sizes that are not a power of 
two (block sizes 192K, 448K, 640K, 832K are used in the wild). The slab 
allocator rounds up the allocation to the nearest power of two, but that 
wastes a lot of memory. Performance of the solution depends on efficient 
memory usage, so we should minimize wasted as much as possible.

Larger-order allocations are unreliable (they may fail at any time if the 
memory is fragmented), but it is not an issue here - the code preallocates 
a few buffers with vmalloc and then allocates buffers from the slab cache. 
If the allocation fails due to memory fragmentation, we throw away and 
reuse some existing buffer, so there is no functionality loss.

Mikulas


From: Mikulas Patocka <mpatocka@redhat.com>

This patch introduces a flag SLAB_MINIMIZE_WASTE for slab and slub. This
flag causes allocation of larger slab caches in order to minimize wasted
space.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 include/linux/slab.h  |    7 +++++++
 mm/slab.c             |    4 ++--
 mm/slab.h             |    7 ++++---
 mm/slab_common.c      |    2 +-
 mm/slub.c             |   25 ++++++++++++++++++++-----
 6 files changed, 35 insertions(+), 12 deletions(-)


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Comments

Matthew Wilcox March 20, 2018, 5:35 p.m. UTC | #1
On Tue, Mar 20, 2018 at 01:25:09PM -0400, Mikulas Patocka wrote:
> The reason why we need this is that we are going to merge code that does 
> block device deduplication (it was developed separatedly and sold as a 
> commercial product), and the code uses block sizes that are not a power of 
> two (block sizes 192K, 448K, 640K, 832K are used in the wild). The slab 
> allocator rounds up the allocation to the nearest power of two, but that 
> wastes a lot of memory. Performance of the solution depends on efficient 
> memory usage, so we should minimize wasted as much as possible.

The SLUB allocator also falls back to using the page (buddy) allocator
for allocations above 8kB, so this patch is going to have no effect on
slub.  You'd be better off using alloc_pages_exact() for this kind of
size, or managing your own pool of pages by using something like five
192k blocks in a 1MB allocation.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Christoph Lameter (Ampere) March 20, 2018, 5:54 p.m. UTC | #2
On Tue, 20 Mar 2018, Matthew Wilcox wrote:

> On Tue, Mar 20, 2018 at 01:25:09PM -0400, Mikulas Patocka wrote:
> > The reason why we need this is that we are going to merge code that does
> > block device deduplication (it was developed separatedly and sold as a
> > commercial product), and the code uses block sizes that are not a power of
> > two (block sizes 192K, 448K, 640K, 832K are used in the wild). The slab
> > allocator rounds up the allocation to the nearest power of two, but that
> > wastes a lot of memory. Performance of the solution depends on efficient
> > memory usage, so we should minimize wasted as much as possible.
>
> The SLUB allocator also falls back to using the page (buddy) allocator
> for allocations above 8kB, so this patch is going to have no effect on
> slub.  You'd be better off using alloc_pages_exact() for this kind of
> size, or managing your own pool of pages by using something like five
> 192k blocks in a 1MB allocation.

The fallback is only effective for kmalloc caches. Manually created caches
do not follow this rule.

Note that you can already control the page orders for allocation and
the objects per slab using

	slub_min_order
	slub_max_order
	slub_min_objects

This is documented in linux/Documentation/vm/slub.txt

Maybe do the same thing for SLAB?


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka March 20, 2018, 7:22 p.m. UTC | #3
On Tue, 20 Mar 2018, Christopher Lameter wrote:

> On Tue, 20 Mar 2018, Matthew Wilcox wrote:
> 
> > On Tue, Mar 20, 2018 at 01:25:09PM -0400, Mikulas Patocka wrote:
> > > The reason why we need this is that we are going to merge code that does
> > > block device deduplication (it was developed separatedly and sold as a
> > > commercial product), and the code uses block sizes that are not a power of
> > > two (block sizes 192K, 448K, 640K, 832K are used in the wild). The slab
> > > allocator rounds up the allocation to the nearest power of two, but that
> > > wastes a lot of memory. Performance of the solution depends on efficient
> > > memory usage, so we should minimize wasted as much as possible.
> >
> > The SLUB allocator also falls back to using the page (buddy) allocator
> > for allocations above 8kB, so this patch is going to have no effect on
> > slub.  You'd be better off using alloc_pages_exact() for this kind of
> > size, or managing your own pool of pages by using something like five
> > 192k blocks in a 1MB allocation.
> 
> The fallback is only effective for kmalloc caches. Manually created caches
> do not follow this rule.

Yes - the dm-bufio layer uses manually created caches.

> Note that you can already control the page orders for allocation and
> the objects per slab using
> 
> 	slub_min_order
> 	slub_max_order
> 	slub_min_objects
> 
> This is documented in linux/Documentation/vm/slub.txt
> 
> Maybe do the same thing for SLAB?

Yes, but I need to change it for a specific cache, not for all caches.

When the order is greater than 3 (PAGE_ALLOC_COSTLY_ORDER), the allocation 
becomes unreliable, thus it is a bad idea to increase slub_max_order 
system-wide.

Another problem with slub_max_order is that it would pad all caches to 
slub_max_order, even those that already have a power-of-two size (in that 
case, the padding is counterproductive).

BTW. the function "order_store" in mm/slub.c modifies the structure 
kmem_cache without taking any locks - is it a bug?

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Christoph Lameter (Ampere) March 20, 2018, 8:42 p.m. UTC | #4
On Tue, 20 Mar 2018, Mikulas Patocka wrote:

> > Maybe do the same thing for SLAB?
>
> Yes, but I need to change it for a specific cache, not for all caches.

Why only some caches?

> When the order is greater than 3 (PAGE_ALLOC_COSTLY_ORDER), the allocation
> becomes unreliable, thus it is a bad idea to increase slub_max_order
> system-wide.

Well the allocations is more likely to fail that is true but SLUB will
fall back to a smaller order should the page allocator refuse to give us
that larger sized page.

> Another problem with slub_max_order is that it would pad all caches to
> slub_max_order, even those that already have a power-of-two size (in that
> case, the padding is counterproductive).

No it does not. Slub will calculate the configuration with the least byte
wastage. It is not the standard order but the maximum order to be used.
Power of two caches below PAGE_SIZE will have order 0.

There are some corner cases where extra metadata is needed per object or
per page that will result in either object sizes that are no longer a
power of two or in page sizes smaller than the whole page. Maybe you have
a case like that? Can you show me a cache that has this issue?

> BTW. the function "order_store" in mm/slub.c modifies the structure
> kmem_cache without taking any locks - is it a bug?

The kmem_cache structure was just allocated. Only one thread can access it
thus no locking is necessary.


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka March 20, 2018, 10:02 p.m. UTC | #5
On Tue, 20 Mar 2018, Christopher Lameter wrote:

> On Tue, 20 Mar 2018, Mikulas Patocka wrote:
> 
> > > Maybe do the same thing for SLAB?
> >
> > Yes, but I need to change it for a specific cache, not for all caches.
> 
> Why only some caches?

I need high order for the buffer cache that holds the deduplicated data. I 
don't need to force it system-wide.

> > When the order is greater than 3 (PAGE_ALLOC_COSTLY_ORDER), the allocation
> > becomes unreliable, thus it is a bad idea to increase slub_max_order
> > system-wide.
> 
> Well the allocations is more likely to fail that is true but SLUB will
> fall back to a smaller order should the page allocator refuse to give us
> that larger sized page.

Does SLAB have this fall-back too?

> > Another problem with slub_max_order is that it would pad all caches to
> > slub_max_order, even those that already have a power-of-two size (in that
> > case, the padding is counterproductive).
> 
> No it does not. Slub will calculate the configuration with the least byte
> wastage. It is not the standard order but the maximum order to be used.
> Power of two caches below PAGE_SIZE will have order 0.

Try to boot with slub_max_order=10 and you can see this in /proc/slabinfo:
kmalloc-8192         352    352   8192   32   64 : tunables    0    0    0 : slabdata     11     11      0
                                             ^^^^

So it rounds up power-of-two sizes to high orders unnecessarily. Without 
slub_max_order=10, the number of pages for the kmalloc-8192 cache is just 
8.

I observe the same pathological rounding in dm-bufio caches.

> There are some corner cases where extra metadata is needed per object or
> per page that will result in either object sizes that are no longer a
> power of two or in page sizes smaller than the whole page. Maybe you have
> a case like that? Can you show me a cache that has this issue?

Here I have a patch set that changes the dm-bufio subsystem to support 
buffer sizes that are not a power of two:
http://people.redhat.com/~mpatocka/patches/kernel/dm-bufio-arbitrary-sector-size/

I need to change the slub cache to minimize wasted space - i.e. when 
asking for a slab cache for 640kB objects, the slub system currently 
allocates 1MB per object and 384kB is wasted. This is the reason why I'm 
making this patch.

> > BTW. the function "order_store" in mm/slub.c modifies the structure
> > kmem_cache without taking any locks - is it a bug?
> 
> The kmem_cache structure was just allocated. Only one thread can access it
> thus no locking is necessary.

No - order_store is called when writing to /sys/kernel/slab/<cache>/order 
- you can modify order for any existing cache - and the modification 
happens without any locking.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Christoph Lameter (Ampere) March 21, 2018, 3:35 p.m. UTC | #6
On Tue, 20 Mar 2018, Mikulas Patocka wrote:

> > > Another problem with slub_max_order is that it would pad all caches to
> > > slub_max_order, even those that already have a power-of-two size (in that
> > > case, the padding is counterproductive).
> >
> > No it does not. Slub will calculate the configuration with the least byte
> > wastage. It is not the standard order but the maximum order to be used.
> > Power of two caches below PAGE_SIZE will have order 0.
>
> Try to boot with slub_max_order=10 and you can see this in /proc/slabinfo:
> kmalloc-8192         352    352   8192   32   64 : tunables    0    0    0 : slabdata     11     11      0

Yes it tries to create a slab size that will accomodate the minimum
objects per slab.

> So it rounds up power-of-two sizes to high orders unnecessarily. Without
> slub_max_order=10, the number of pages for the kmalloc-8192 cache is just
> 8.

The kmalloc-8192 has 4 objects per slab on my system which means an
allocation size of 32k = order 4.

In this case 4 objects fit tightly into a slab. There is no waste.

But then I thought you were talking about manually created slabs not
about the kmalloc array?

> I observe the same pathological rounding in dm-bufio caches.
>
> > There are some corner cases where extra metadata is needed per object or
> > per page that will result in either object sizes that are no longer a
> > power of two or in page sizes smaller than the whole page. Maybe you have
> > a case like that? Can you show me a cache that has this issue?
>
> Here I have a patch set that changes the dm-bufio subsystem to support
> buffer sizes that are not a power of two:
> http://people.redhat.com/~mpatocka/patches/kernel/dm-bufio-arbitrary-sector-size/
>
> I need to change the slub cache to minimize wasted space - i.e. when
> asking for a slab cache for 640kB objects, the slub system currently
> allocates 1MB per object and 384kB is wasted. This is the reason why I'm
> making this patch.

You should not be using the slab allocators for these. Allocate higher
order pages or numbers of consecutive smaller pagess from the page
allocator. The slab allocators are written for objects smaller than page
size.

> > > BTW. the function "order_store" in mm/slub.c modifies the structure
> > > kmem_cache without taking any locks - is it a bug?
> >
> > The kmem_cache structure was just allocated. Only one thread can access it
> > thus no locking is necessary.
>
> No - order_store is called when writing to /sys/kernel/slab/<cache>/order
> - you can modify order for any existing cache - and the modification
> happens without any locking.

Well it still does not matter. The size of the order of slab pages
can be dynamic even within a slab. You can have pages of varying sizes.

What kind of problem could be caused here?

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka March 21, 2018, 4:25 p.m. UTC | #7
On Wed, 21 Mar 2018, Christopher Lameter wrote:

> On Tue, 20 Mar 2018, Mikulas Patocka wrote:
> 
> > > > Another problem with slub_max_order is that it would pad all caches to
> > > > slub_max_order, even those that already have a power-of-two size (in that
> > > > case, the padding is counterproductive).
> > >
> > > No it does not. Slub will calculate the configuration with the least byte
> > > wastage. It is not the standard order but the maximum order to be used.
> > > Power of two caches below PAGE_SIZE will have order 0.
> >
> > Try to boot with slub_max_order=10 and you can see this in /proc/slabinfo:
> > kmalloc-8192         352    352   8192   32   64 : tunables    0    0    0 : slabdata     11     11      0
> 
> Yes it tries to create a slab size that will accomodate the minimum
> objects per slab.
> 
> > So it rounds up power-of-two sizes to high orders unnecessarily. Without
> > slub_max_order=10, the number of pages for the kmalloc-8192 cache is just
> > 8.
> 
> The kmalloc-8192 has 4 objects per slab on my system which means an
> allocation size of 32k = order 4.
> 
> In this case 4 objects fit tightly into a slab. There is no waste.
> 
> But then I thought you were talking about manually created slabs not
> about the kmalloc array?

For some workloads, dm-bufio needs caches with sizes that are a power of 
two (majority of workloads fall into this cathegory). For other workloads 
dm-bufio needs caches with sizes that are not a power of two.

Now - we don't want higher-order allocations for power-of-two caches 
(because higher-order allocations just cause memory fragmentation without 
any benefit), but we want higher-order allocations for non-power-of-two 
caches (because higher-order allocations minimize wasted space).

For example:
for 192K block size, the ideal order is 4MB (it takes 21 blocks)
for 448K block size, the ideal order is 4MB (it takes 9 blocks)
for 512K block size, the ideal order is 512KB (there is no benefit from 
	using higher order)
for 640K block size, the ideal order is 2MB (it takes 3 blocks, increasing 
	the allocation size to 4MB doesn't result in any benefit)
for 832K block size, the ideal order is 1MB (it takes 1 block, increasing
	the allocation to 2MB or 4MB doesn't result in any benefit)
for 1M block size, the ideal order is 1MB

The problem with "slub_max_order" is that it increases the order either 
always or never, but doesn't have the capability to calculate the ideal 
order for the given object size. The patch that I send just does this 
calculation.

Another problem wit "slub_max_order" is that the device driver that needs 
to create a slab cache cannot really set it - the device driver can't 
modify the kernel parameters.

> > I observe the same pathological rounding in dm-bufio caches.
> >
> > > There are some corner cases where extra metadata is needed per object or
> > > per page that will result in either object sizes that are no longer a
> > > power of two or in page sizes smaller than the whole page. Maybe you have
> > > a case like that? Can you show me a cache that has this issue?
> >
> > Here I have a patch set that changes the dm-bufio subsystem to support
> > buffer sizes that are not a power of two:
> > http://people.redhat.com/~mpatocka/patches/kernel/dm-bufio-arbitrary-sector-size/
> >
> > I need to change the slub cache to minimize wasted space - i.e. when
> > asking for a slab cache for 640kB objects, the slub system currently
> > allocates 1MB per object and 384kB is wasted. This is the reason why I'm
> > making this patch.
> 
> You should not be using the slab allocators for these. Allocate higher
> order pages or numbers of consecutive smaller pagess from the page
> allocator. The slab allocators are written for objects smaller than page
> size.

So, do you argue that I need to write my own slab cache functionality 
instead of using the existing slab code?

I can do it - but duplicating code is bad thing.

> > > > BTW. the function "order_store" in mm/slub.c modifies the structure
> > > > kmem_cache without taking any locks - is it a bug?
> > >
> > > The kmem_cache structure was just allocated. Only one thread can access it
> > > thus no locking is necessary.
> >
> > No - order_store is called when writing to /sys/kernel/slab/<cache>/order
> > - you can modify order for any existing cache - and the modification
> > happens without any locking.
> 
> Well it still does not matter. The size of the order of slab pages
> can be dynamic even within a slab. You can have pages of varying sizes.
> 
> What kind of problem could be caused here?

Unlocked accesses are generally considered bad. For example, see this 
piece of code in calculate_sizes:
        s->allocflags = 0;
        if (order)
                s->allocflags |= __GFP_COMP;

        if (s->flags & SLAB_CACHE_DMA)
                s->allocflags |= GFP_DMA;

        if (s->flags & SLAB_RECLAIM_ACCOUNT)
                s->allocflags |= __GFP_RECLAIMABLE;

If you are running this while the cache is in use (i.e. when the user 
writes /sys/kernel/slab/<cache>/order), then other processes will see 
invalid s->allocflags for a short time.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Matthew Wilcox March 21, 2018, 5:10 p.m. UTC | #8
On Wed, Mar 21, 2018 at 12:25:39PM -0400, Mikulas Patocka wrote:
> Now - we don't want higher-order allocations for power-of-two caches 
> (because higher-order allocations just cause memory fragmentation without 
> any benefit)

Higher-order allocations don't cause memory fragmentation.  Indeed,
they avoid it.  They do fail as a result of fragmentation, which is
probably what you meant.

> , but we want higher-order allocations for non-power-of-two 
> caches (because higher-order allocations minimize wasted space).
> 
> For example:
> for 192K block size, the ideal order is 4MB (it takes 21 blocks)

I wonder if that's true.  You can get five blocks into 1MB, wasting 64kB.
So going up by two orders of magnitude lets you get an extra block in
at the cost of failing more frequently.

> > You should not be using the slab allocators for these. Allocate higher
> > order pages or numbers of consecutive smaller pagess from the page
> > allocator. The slab allocators are written for objects smaller than page
> > size.
> 
> So, do you argue that I need to write my own slab cache functionality 
> instead of using the existing slab code?
> 
> I can do it - but duplicating code is bad thing.

It is -- but writing a special-purpose allocator can be better than making
a general purpose allocator also solve a special purpose.  I don't know
whether that's true here or not.

Your allocator seems like it could be remarkably simple; you know
you're always doing high-order allocations, and you know that you're
never allocating more than a handful of blocks from a page allocation.
So you can probably store all of your metadata in the struct page
(because your metadata is basically a bitmap) and significantly save on
memory usage.  The one downside I see is that you don't get the reporting
through /proc/slabinfo.

So, is this an area where slub should be improved, or is this a case where
writing a special-purpose allocator makes more sense?  It seems like you
already have a special-purpose allocator, in that you know how to fall
back to vmalloc if slab-alloc fails.  So maybe have your own allocator
that interfaces to the page allocator for now; keep its interface nice
and clean, and maybe it'll get pulled out of your driver and put into mm/
some day if it becomes a useful API for everybody to share?

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Christoph Lameter (Ampere) March 21, 2018, 5:30 p.m. UTC | #9
On Wed, 21 Mar 2018, Mikulas Patocka wrote:

> > You should not be using the slab allocators for these. Allocate higher
> > order pages or numbers of consecutive smaller pagess from the page
> > allocator. The slab allocators are written for objects smaller than page
> > size.
>
> So, do you argue that I need to write my own slab cache functionality
> instead of using the existing slab code?

Just use the existing page allocator calls to allocate and free the
memory you need.

> I can do it - but duplicating code is bad thing.

There is no need to duplicate anything. There is lots of infrastructure
already in the kernel. You just need to use the right allocation / freeing
calls.

> > What kind of problem could be caused here?
>
> Unlocked accesses are generally considered bad. For example, see this
> piece of code in calculate_sizes:
>         s->allocflags = 0;
>         if (order)
>                 s->allocflags |= __GFP_COMP;
>
>         if (s->flags & SLAB_CACHE_DMA)
>                 s->allocflags |= GFP_DMA;
>
>         if (s->flags & SLAB_RECLAIM_ACCOUNT)
>                 s->allocflags |= __GFP_RECLAIMABLE;
>
> If you are running this while the cache is in use (i.e. when the user
> writes /sys/kernel/slab/<cache>/order), then other processes will see
> invalid s->allocflags for a short time.

Calculating sizes is done when the slab has only a single accessor. Thus
no locking is neeed.

Changing the size of objects in a slab cache when there is already a set
of object allocated and under management by the slab cache would
cause the allocator to fail and lead to garbled data.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Christoph Lameter (Ampere) March 21, 2018, 5:39 p.m. UTC | #10
One other thought: If you want to improve the behavior for large scale
objects allocated through kmalloc/kmemcache then we would certainly be
glad to entertain those ideas.

F.e. you could optimize the allcations > 2x PAGE_SIZE so that they do not
allocate powers of two pages. It would be relatively easy to make
kmalloc_large round the allocation to the next page size and then allocate
N consecutive pages via alloc_pages_exact() and free the remainder unused
pages or some such thing.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Matthew Wilcox March 21, 2018, 5:49 p.m. UTC | #11
On Wed, Mar 21, 2018 at 12:39:33PM -0500, Christopher Lameter wrote:
> One other thought: If you want to improve the behavior for large scale
> objects allocated through kmalloc/kmemcache then we would certainly be
> glad to entertain those ideas.
> 
> F.e. you could optimize the allcations > 2x PAGE_SIZE so that they do not
> allocate powers of two pages. It would be relatively easy to make
> kmalloc_large round the allocation to the next page size and then allocate
> N consecutive pages via alloc_pages_exact() and free the remainder unused
> pages or some such thing.

I don't know if that's a good idea.  That will contribute to fragmentation
if the allocation is held onto for a short-to-medium length of time.
If the allocation is for a very long period of time then those pages
would have been unavailable anyway, but if the user of the tail pages
holds them beyond the lifetime of the large allocation, then this is
probably a bad tradeoff to make.

I do see Mikulas' use case as interesting, I just don't know whether it's
worth changing slab/slub to support it.  At first blush, other than the
sheer size of the allocations, it's a good fit.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Christoph Lameter (Ampere) March 21, 2018, 6:01 p.m. UTC | #12
On Wed, 21 Mar 2018, Matthew Wilcox wrote:

> I don't know if that's a good idea.  That will contribute to fragmentation
> if the allocation is held onto for a short-to-medium length of time.
> If the allocation is for a very long period of time then those pages
> would have been unavailable anyway, but if the user of the tail pages
> holds them beyond the lifetime of the large allocation, then this is
> probably a bad tradeoff to make.
>
> I do see Mikulas' use case as interesting, I just don't know whether it's
> worth changing slab/slub to support it.  At first blush, other than the
> sheer size of the allocations, it's a good fit.

Well there are numerous page pool approaches already in the kernel. Maybe
there is a better fit with those?

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka March 21, 2018, 6:23 p.m. UTC | #13
On Wed, 21 Mar 2018, Matthew Wilcox wrote:

> On Wed, Mar 21, 2018 at 12:39:33PM -0500, Christopher Lameter wrote:
> > One other thought: If you want to improve the behavior for large scale
> > objects allocated through kmalloc/kmemcache then we would certainly be
> > glad to entertain those ideas.
> > 
> > F.e. you could optimize the allcations > 2x PAGE_SIZE so that they do not
> > allocate powers of two pages. It would be relatively easy to make
> > kmalloc_large round the allocation to the next page size and then allocate
> > N consecutive pages via alloc_pages_exact() and free the remainder unused
> > pages or some such thing.

alloc_pages_exact() has O(n*log n) complexity with respect to the number 
of requested pages. It would have to be reworked and optimized if it were 
to be used for the dm-bufio cache. (it could be optimized down to O(log n) 
if it didn't split the compound page to a lot of separate pages, but split 
it to a power-of-two clusters instead).

> I don't know if that's a good idea.  That will contribute to fragmentation
> if the allocation is held onto for a short-to-medium length of time.
> If the allocation is for a very long period of time then those pages
> would have been unavailable anyway, but if the user of the tail pages
> holds them beyond the lifetime of the large allocation, then this is
> probably a bad tradeoff to make.

The problem with alloc_pages_exact() is that it exhausts all the 
high-order pages and leaves many free low-order pages around. So you'll 
end up in a system with a lot of free memory, but with all high-order 
pages missing. As there would be a lot of free memory, the kswapd thread 
would not be woken up to free some high-order pages.

I think that using slab with high order is better, because it at least 
doesn't leave many low-order pages behind.

> I do see Mikulas' use case as interesting, I just don't know whether it's
> worth changing slab/slub to support it.  At first blush, other than the
> sheer size of the allocations, it's a good fit.

All I need is to increase the order of a specific slab cache - I think 
it's better to implement an interface that allows doing it than to 
duplicate the slab cache code.

BTW. it could be possible to open the file 
"/sys/kernel/slab/<cache>/order" from the dm-bufio kernel driver and write 
the requested value there, but it seems very dirty. It would be better to 
have a kernel interface for that.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka March 21, 2018, 6:36 p.m. UTC | #14
On Wed, 21 Mar 2018, Christopher Lameter wrote:

> On Wed, 21 Mar 2018, Mikulas Patocka wrote:
> 
> > > You should not be using the slab allocators for these. Allocate higher
> > > order pages or numbers of consecutive smaller pagess from the page
> > > allocator. The slab allocators are written for objects smaller than page
> > > size.
> >
> > So, do you argue that I need to write my own slab cache functionality
> > instead of using the existing slab code?
> 
> Just use the existing page allocator calls to allocate and free the
> memory you need.
> 
> > I can do it - but duplicating code is bad thing.
> 
> There is no need to duplicate anything. There is lots of infrastructure
> already in the kernel. You just need to use the right allocation / freeing
> calls.

So, what would you recommend for allocating 640KB objects while minimizing 
wasted space?
* alloc_pages - rounds up to the next power of two
* kmalloc - rounds up to the next power of two
* alloc_pages_exact - O(n*log n) complexity; and causes memory 
  fragmentation if used excesivelly
* vmalloc - horrible performance (modifies page tables and that causes 
  synchronization across all CPUs)

anything else?

The slab cache with large order seems as a best choice for this.

> > > What kind of problem could be caused here?
> >
> > Unlocked accesses are generally considered bad. For example, see this
> > piece of code in calculate_sizes:
> >         s->allocflags = 0;
> >         if (order)
> >                 s->allocflags |= __GFP_COMP;
> >
> >         if (s->flags & SLAB_CACHE_DMA)
> >                 s->allocflags |= GFP_DMA;
> >
> >         if (s->flags & SLAB_RECLAIM_ACCOUNT)
> >                 s->allocflags |= __GFP_RECLAIMABLE;
> >
> > If you are running this while the cache is in use (i.e. when the user
> > writes /sys/kernel/slab/<cache>/order), then other processes will see
> > invalid s->allocflags for a short time.
> 
> Calculating sizes is done when the slab has only a single accessor. Thus
> no locking is neeed.

The calculation is done whenever someone writes to 
"/sys/kernel/slab/*/order"

And you can obviously write to that file why the slab cache is in use. Try 
it.

So, the function calculate_sizes can actually race with allocation from 
the slab cache.

> Changing the size of objects in a slab cache when there is already a set
> of object allocated and under management by the slab cache would
> cause the allocator to fail and lead to garbled data.

I am not talking about changing the size of objects in a slab cache. I am 
talking about changing the allocation order of a slab cache while the 
cache is in use. This can be done with the sysfs interface.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Christoph Lameter (Ampere) March 21, 2018, 6:40 p.m. UTC | #15
On Wed, 21 Mar 2018, Mikulas Patocka wrote:

> > > F.e. you could optimize the allcations > 2x PAGE_SIZE so that they do not
> > > allocate powers of two pages. It would be relatively easy to make
> > > kmalloc_large round the allocation to the next page size and then allocate
> > > N consecutive pages via alloc_pages_exact() and free the remainder unused
> > > pages or some such thing.
>
> alloc_pages_exact() has O(n*log n) complexity with respect to the number
> of requested pages. It would have to be reworked and optimized if it were
> to be used for the dm-bufio cache. (it could be optimized down to O(log n)
> if it didn't split the compound page to a lot of separate pages, but split
> it to a power-of-two clusters instead).

Well then a memory pool of page allocator requests may address that issue?

Have a look at include/linux/mempool.h.

> > I don't know if that's a good idea.  That will contribute to fragmentation
> > if the allocation is held onto for a short-to-medium length of time.
> > If the allocation is for a very long period of time then those pages
> > would have been unavailable anyway, but if the user of the tail pages
> > holds them beyond the lifetime of the large allocation, then this is
> > probably a bad tradeoff to make.

Fragmentation is sadly a big issue. You could create a mempool on bootup
or early after boot to ensure that you have a sufficient number of
contiguous pages available.

> The problem with alloc_pages_exact() is that it exhausts all the
> high-order pages and leaves many free low-order pages around. So you'll
> end up in a system with a lot of free memory, but with all high-order
> pages missing. As there would be a lot of free memory, the kswapd thread
> would not be woken up to free some high-order pages.

I think that logic is properly balanced and will take into account pages
that have been removed from the LRU expiration logic.

> I think that using slab with high order is better, because it at least
> doesn't leave many low-order pages behind.

Any request to the slab via kmalloc with a size > 2x page size will simply
lead to a page allocator request. You have the same issue. If you want to
rely on the slab allocator buffering large segments for you then a mempool
will also solve the issue for you and you have more control over the pool.

> BTW. it could be possible to open the file
> "/sys/kernel/slab/<cache>/order" from the dm-bufio kernel driver and write
> the requested value there, but it seems very dirty. It would be better to
> have a kernel interface for that.

Hehehe you could directly write to the kmem_cache structure and increase
the order. AFAICT this would be dirty but work.

But still the increased page order will get you into trouble with
fragmentation when the system runs for a long time. That is the reason we
try to limit the allocation sizes coming from the slab allocator.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka March 21, 2018, 6:55 p.m. UTC | #16
On Wed, 21 Mar 2018, Christopher Lameter wrote:

> On Wed, 21 Mar 2018, Mikulas Patocka wrote:
> 
> > > > F.e. you could optimize the allcations > 2x PAGE_SIZE so that they do not
> > > > allocate powers of two pages. It would be relatively easy to make
> > > > kmalloc_large round the allocation to the next page size and then allocate
> > > > N consecutive pages via alloc_pages_exact() and free the remainder unused
> > > > pages or some such thing.
> >
> > alloc_pages_exact() has O(n*log n) complexity with respect to the number
> > of requested pages. It would have to be reworked and optimized if it were
> > to be used for the dm-bufio cache. (it could be optimized down to O(log n)
> > if it didn't split the compound page to a lot of separate pages, but split
> > it to a power-of-two clusters instead).
> 
> Well then a memory pool of page allocator requests may address that issue?
> 
> Have a look at include/linux/mempool.h.

I know the mempool interface. Mempool can keep some amount reserved 
objects if the system memory is exhausted.

Mempool doesn't deal with object allocation at all - mempool needs to be 
hooked to an existing object allocator (slab cache, kmalloc, alloc_pages, 
or some custom allocator provided with the methods mempool_alloc_t and 
mempool_free_t).

> > > I don't know if that's a good idea.  That will contribute to fragmentation
> > > if the allocation is held onto for a short-to-medium length of time.
> > > If the allocation is for a very long period of time then those pages
> > > would have been unavailable anyway, but if the user of the tail pages
> > > holds them beyond the lifetime of the large allocation, then this is
> > > probably a bad tradeoff to make.
> 
> Fragmentation is sadly a big issue. You could create a mempool on bootup
> or early after boot to ensure that you have a sufficient number of
> contiguous pages available.

The dm-bufio driver deals correctly with this - it preallocates several 
buffers with vmalloc when the dm-bufio cache is created. During operation, 
if a high-order allocation fails, the dm-bufio subsystem throws away some 
existing buffer and reuses the already allocated chunk of memory for the 
buffer that needs to be created.

So, fragmentation is not an issue with this use case. dm-bufio can make 
forward progress even if memory is totally exhausted.

> > The problem with alloc_pages_exact() is that it exhausts all the
> > high-order pages and leaves many free low-order pages around. So you'll
> > end up in a system with a lot of free memory, but with all high-order
> > pages missing. As there would be a lot of free memory, the kswapd thread
> > would not be woken up to free some high-order pages.
> 
> I think that logic is properly balanced and will take into account pages
> that have been removed from the LRU expiration logic.
> 
> > I think that using slab with high order is better, because it at least
> > doesn't leave many low-order pages behind.
> 
> Any request to the slab via kmalloc with a size > 2x page size will simply
> lead to a page allocator request. You have the same issue. If you want to
> rely on the slab allocator buffering large segments for you then a mempool
> will also solve the issue for you and you have more control over the pool.

mempool solves nothing because it needs a backing allocator. And the 
question is what this backing allocator should be.

> > BTW. it could be possible to open the file
> > "/sys/kernel/slab/<cache>/order" from the dm-bufio kernel driver and write
> > the requested value there, but it seems very dirty. It would be better to
> > have a kernel interface for that.
> 
> Hehehe you could directly write to the kmem_cache structure and increase
> the order. AFAICT this would be dirty but work.
> 
> But still the increased page order will get you into trouble with
> fragmentation when the system runs for a long time. That is the reason we
> try to limit the allocation sizes coming from the slab allocator.

It won't - see above - if the high-order allocation fails, dm-bufio just 
reuses some existing buffer.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Matthew Wilcox March 21, 2018, 6:55 p.m. UTC | #17
On Wed, Mar 21, 2018 at 01:40:31PM -0500, Christopher Lameter wrote:
> On Wed, 21 Mar 2018, Mikulas Patocka wrote:
> 
> > > > F.e. you could optimize the allcations > 2x PAGE_SIZE so that they do not
> > > > allocate powers of two pages. It would be relatively easy to make
> > > > kmalloc_large round the allocation to the next page size and then allocate
> > > > N consecutive pages via alloc_pages_exact() and free the remainder unused
> > > > pages or some such thing.
> >
> > alloc_pages_exact() has O(n*log n) complexity with respect to the number
> > of requested pages. It would have to be reworked and optimized if it were
> > to be used for the dm-bufio cache. (it could be optimized down to O(log n)
> > if it didn't split the compound page to a lot of separate pages, but split
> > it to a power-of-two clusters instead).
> 
> Well then a memory pool of page allocator requests may address that issue?
> 
> Have a look at include/linux/mempool.h.

That's not what mempool is for.  mempool is a cache of elements that were
allocated from slab in the first place.  (OK, technically, you don't have
to use slab as the allocator, but since there is no allocator that solves
this problem, mempool doesn't solve the problem either!)

> > BTW. it could be possible to open the file
> > "/sys/kernel/slab/<cache>/order" from the dm-bufio kernel driver and write
> > the requested value there, but it seems very dirty. It would be better to
> > have a kernel interface for that.
> 
> Hehehe you could directly write to the kmem_cache structure and increase
> the order. AFAICT this would be dirty but work.
> 
> But still the increased page order will get you into trouble with
> fragmentation when the system runs for a long time. That is the reason we
> try to limit the allocation sizes coming from the slab allocator.

Right; he has a fallback already (vmalloc).  So ... let's just add the
interface to allow slab caches to have their order tuned by users who
really know what they're doing?

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Christoph Lameter (Ampere) March 21, 2018, 6:57 p.m. UTC | #18
On Wed, 21 Mar 2018, Mikulas Patocka wrote:

> So, what would you recommend for allocating 640KB objects while minimizing
> wasted space?
> * alloc_pages - rounds up to the next power of two
> * kmalloc - rounds up to the next power of two
> * alloc_pages_exact - O(n*log n) complexity; and causes memory
>   fragmentation if used excesivelly
> * vmalloc - horrible performance (modifies page tables and that causes
>   synchronization across all CPUs)
>
> anything else?

Need to find it but there is a way to allocate N pages in sequence
somewhere. Otherwise mempools are something that would work.

> > > > What kind of problem could be caused here?
> > >
> > > Unlocked accesses are generally considered bad. For example, see this
> > > piece of code in calculate_sizes:
> > >         s->allocflags = 0;
> > >         if (order)
> > >                 s->allocflags |= __GFP_COMP;
> > >
> > >         if (s->flags & SLAB_CACHE_DMA)
> > >                 s->allocflags |= GFP_DMA;
> > >
> > >         if (s->flags & SLAB_RECLAIM_ACCOUNT)
> > >                 s->allocflags |= __GFP_RECLAIMABLE;
> > >
> > > If you are running this while the cache is in use (i.e. when the user
> > > writes /sys/kernel/slab/<cache>/order), then other processes will see
> > > invalid s->allocflags for a short time.
> >
> > Calculating sizes is done when the slab has only a single accessor. Thus
> > no locking is neeed.
>
> The calculation is done whenever someone writes to
> "/sys/kernel/slab/*/order"

But the flags you are mentioning do not change and the size of the object
does not change. What changes is the number of objects in the slab page.

> And you can obviously write to that file why the slab cache is in use. Try
> it.

You cannot change flags that would impact the size of the objects. I only
allowed changing characteristics that does not impact object size.

> I am not talking about changing the size of objects in a slab cache. I am
> talking about changing the allocation order of a slab cache while the
> cache is in use. This can be done with the sysfs interface.

But then that is something that is allowed but does not affect the object
size used by the slab allocators.


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Christoph Lameter (Ampere) March 21, 2018, 6:58 p.m. UTC | #19
On Wed, 21 Mar 2018, Matthew Wilcox wrote:

> > Have a look at include/linux/mempool.h.
>
> That's not what mempool is for.  mempool is a cache of elements that were
> allocated from slab in the first place.  (OK, technically, you don't have
> to use slab as the allocator, but since there is no allocator that solves
> this problem, mempool doesn't solve the problem either!)

You can put the page allocator in there instead of a slab allocator.

> > But still the increased page order will get you into trouble with
> > fragmentation when the system runs for a long time. That is the reason we
> > try to limit the allocation sizes coming from the slab allocator.
>
> Right; he has a fallback already (vmalloc).  So ... let's just add the
> interface to allow slab caches to have their order tuned by users who
> really know what they're doing?

Ok thats trivial.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka March 21, 2018, 7:19 p.m. UTC | #20
On Wed, 21 Mar 2018, Christopher Lameter wrote:

> On Wed, 21 Mar 2018, Mikulas Patocka wrote:
> 
> > So, what would you recommend for allocating 640KB objects while minimizing
> > wasted space?
> > * alloc_pages - rounds up to the next power of two
> > * kmalloc - rounds up to the next power of two
> > * alloc_pages_exact - O(n*log n) complexity; and causes memory
> >   fragmentation if used excesivelly
> > * vmalloc - horrible performance (modifies page tables and that causes
> >   synchronization across all CPUs)
> >
> > anything else?
> 
> Need to find it but there is a way to allocate N pages in sequence
> somewhere. Otherwise mempools are something that would work.

There's also continuous-memory-allocator, but it needs its memory to be 
reserved at boot time. It is intended for misdesigned hardware devices 
that need continuous memory for DMA. As it's intended for one-time 
allocations when loading drivers, it lacks the performance and scalability 
of the slab cache and alloc_pages.

> > > > > What kind of problem could be caused here?
> > > >
> > > > Unlocked accesses are generally considered bad. For example, see this
> > > > piece of code in calculate_sizes:
> > > >         s->allocflags = 0;
> > > >         if (order)
> > > >                 s->allocflags |= __GFP_COMP;
> > > >
> > > >         if (s->flags & SLAB_CACHE_DMA)
> > > >                 s->allocflags |= GFP_DMA;
> > > >
> > > >         if (s->flags & SLAB_RECLAIM_ACCOUNT)
> > > >                 s->allocflags |= __GFP_RECLAIMABLE;
> > > >
> > > > If you are running this while the cache is in use (i.e. when the user
> > > > writes /sys/kernel/slab/<cache>/order), then other processes will see
> > > > invalid s->allocflags for a short time.
> > >
> > > Calculating sizes is done when the slab has only a single accessor. Thus
> > > no locking is neeed.
> >
> > The calculation is done whenever someone writes to
> > "/sys/kernel/slab/*/order"
> 
> But the flags you are mentioning do not change and the size of the object
> does not change. What changes is the number of objects in the slab page.

See this code again:
> > >         s->allocflags = 0;
> > >         if (order)
> > >                 s->allocflags |= __GFP_COMP;
> > >
> > >         if (s->flags & SLAB_CACHE_DMA)
> > >                 s->allocflags |= GFP_DMA;
> > >
> > >         if (s->flags & SLAB_RECLAIM_ACCOUNT)
> > >                 s->allocflags |= __GFP_RECLAIMABLE;
when this function is called, the value s->allocflags does change. At the 
end, s->allocflags holds the same value as before, but it changes 
temporarily.

For example, if someone creates a slab cache with the flag SLAB_CACHE_DMA, 
and he allocates an object from this cache and this allocation races with 
the user writing to /sys/kernel/slab/cache/order - then the allocator can 
for a small period of time see "s->allocflags == 0" and allocate a non-DMA 
page. That is a bug.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka March 21, 2018, 7:25 p.m. UTC | #21
On Wed, 21 Mar 2018, Christopher Lameter wrote:

> One other thought: If you want to improve the behavior for large scale
> objects allocated through kmalloc/kmemcache then we would certainly be
> glad to entertain those ideas.
> 
> F.e. you could optimize the allcations > 2x PAGE_SIZE so that they do not
> allocate powers of two pages. It would be relatively easy to make
> kmalloc_large round the allocation to the next page size and then allocate
> N consecutive pages via alloc_pages_exact() and free the remainder unused
> pages or some such thing.

It may be possible, but we'd need to improve the horrible complexity of 
alloc_pages_exact().

This is a trade-of between performance and waste. A power-of-two 
allocation can be done quicky, but it wastes a lot of space. 
alloc_pages_exact() wastes less space, but it is slow.

The question is - how many of these large-kmalloc allocations are 
short-lived and how many are long-lived? I don't know, I haven't measured 
it.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Vlastimil Babka April 13, 2018, 9:22 a.m. UTC | #22
On 03/21/2018 07:36 PM, Mikulas Patocka wrote:
> 
> 
> On Wed, 21 Mar 2018, Christopher Lameter wrote:
> 
>> On Wed, 21 Mar 2018, Mikulas Patocka wrote:
>>
>>>> You should not be using the slab allocators for these. Allocate higher
>>>> order pages or numbers of consecutive smaller pagess from the page
>>>> allocator. The slab allocators are written for objects smaller than page
>>>> size.
>>>
>>> So, do you argue that I need to write my own slab cache functionality
>>> instead of using the existing slab code?
>>
>> Just use the existing page allocator calls to allocate and free the
>> memory you need.
>>
>>> I can do it - but duplicating code is bad thing.
>>
>> There is no need to duplicate anything. There is lots of infrastructure
>> already in the kernel. You just need to use the right allocation / freeing
>> calls.
> 
> So, what would you recommend for allocating 640KB objects while minimizing 
> wasted space?
> * alloc_pages - rounds up to the next power of two
> * kmalloc - rounds up to the next power of two
> * alloc_pages_exact - O(n*log n) complexity; and causes memory 
>   fragmentation if used excesivelly
> * vmalloc - horrible performance (modifies page tables and that causes 
>   synchronization across all CPUs)
> 
> anything else?
> 
> The slab cache with large order seems as a best choice for this.

Sorry for being late, I just read this thread and tend to agree with
Mikulas, that this is a good use case for SL*B. If we extend the
use-case from "space-efficient allocator of objects smaller than page
size" to "space-efficient allocator of objects that are not power-of-two
pages" then IMHO it turns out the implementation would be almost the
same. All other variants listed above would lead to waste of memory or
fragmentation.

Would this perhaps be a good LSF/MM discussion topic? Mikulas, are you
attending, or anyone else that can vouch for your usecase?

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer April 13, 2018, 3:10 p.m. UTC | #23
On Fri, Apr 13 2018 at  5:22am -0400,
Vlastimil Babka <vbabka@suse.cz> wrote:

> On 03/21/2018 07:36 PM, Mikulas Patocka wrote:
> > 
> > 
> > On Wed, 21 Mar 2018, Christopher Lameter wrote:
> > 
> >> On Wed, 21 Mar 2018, Mikulas Patocka wrote:
> >>
> >>>> You should not be using the slab allocators for these. Allocate higher
> >>>> order pages or numbers of consecutive smaller pagess from the page
> >>>> allocator. The slab allocators are written for objects smaller than page
> >>>> size.
> >>>
> >>> So, do you argue that I need to write my own slab cache functionality
> >>> instead of using the existing slab code?
> >>
> >> Just use the existing page allocator calls to allocate and free the
> >> memory you need.
> >>
> >>> I can do it - but duplicating code is bad thing.
> >>
> >> There is no need to duplicate anything. There is lots of infrastructure
> >> already in the kernel. You just need to use the right allocation / freeing
> >> calls.
> > 
> > So, what would you recommend for allocating 640KB objects while minimizing 
> > wasted space?
> > * alloc_pages - rounds up to the next power of two
> > * kmalloc - rounds up to the next power of two
> > * alloc_pages_exact - O(n*log n) complexity; and causes memory 
> >   fragmentation if used excesivelly
> > * vmalloc - horrible performance (modifies page tables and that causes 
> >   synchronization across all CPUs)
> > 
> > anything else?
> > 
> > The slab cache with large order seems as a best choice for this.
> 
> Sorry for being late, I just read this thread and tend to agree with
> Mikulas, that this is a good use case for SL*B. If we extend the
> use-case from "space-efficient allocator of objects smaller than page
> size" to "space-efficient allocator of objects that are not power-of-two
> pages" then IMHO it turns out the implementation would be almost the
> same. All other variants listed above would lead to waste of memory or
> fragmentation.
> 
> Would this perhaps be a good LSF/MM discussion topic? Mikulas, are you
> attending, or anyone else that can vouch for your usecase?

Any further discussion on SLAB_MINIMIZE_WASTE should continue on list.

Mikulas won't be at LSF/MM.  But I included Mikulas' dm-bufio changes
that no longer depend on this proposed SLAB_MINIMIZE_WASTE (as part of
the 4.17 merge window).

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Vlastimil Babka April 16, 2018, 12:38 p.m. UTC | #24
On 04/13/2018 05:10 PM, Mike Snitzer wrote:
> On Fri, Apr 13 2018 at  5:22am -0400,
> Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> Would this perhaps be a good LSF/MM discussion topic? Mikulas, are you
>> attending, or anyone else that can vouch for your usecase?
> 
> Any further discussion on SLAB_MINIMIZE_WASTE should continue on list.
> 
> Mikulas won't be at LSF/MM.  But I included Mikulas' dm-bufio changes
> that no longer depend on this proposed SLAB_MINIMIZE_WASTE (as part of
> the 4.17 merge window).

Can you or Mikulas briefly summarize how the dependency is avoided, and
whether if (something like) SLAB_MINIMIZE_WASTE were implemented, the
dm-bufio code would happily switch to it, or not?

Thanks,
Vlastimil

> Mike
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer April 16, 2018, 2:27 p.m. UTC | #25
On Mon, Apr 16 2018 at  8:38am -0400,
Vlastimil Babka <vbabka@suse.cz> wrote:

> On 04/13/2018 05:10 PM, Mike Snitzer wrote:
> > On Fri, Apr 13 2018 at  5:22am -0400,
> > Vlastimil Babka <vbabka@suse.cz> wrote:
> >>
> >> Would this perhaps be a good LSF/MM discussion topic? Mikulas, are you
> >> attending, or anyone else that can vouch for your usecase?
> > 
> > Any further discussion on SLAB_MINIMIZE_WASTE should continue on list.
> > 
> > Mikulas won't be at LSF/MM.  But I included Mikulas' dm-bufio changes
> > that no longer depend on this proposed SLAB_MINIMIZE_WASTE (as part of
> > the 4.17 merge window).
> 
> Can you or Mikulas briefly summarize how the dependency is avoided, and
> whether if (something like) SLAB_MINIMIZE_WASTE were implemented, the
> dm-bufio code would happily switch to it, or not?

git log eeb67a0ba04df^..45354f1eb67224669a1 -- drivers/md/dm-bufio.c

But the most signficant commit relative to SLAB_MINIMIZE_WASTE is:
359dbf19ab524652a2208a2a2cddccec2eede2ad ("dm bufio: use slab cache for dm_buffer structure allocations")

So no, I don't see why dm-bufio would need to switch to
SLAB_MINIMIZE_WASTE if it were introduced in the future.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka April 16, 2018, 2:37 p.m. UTC | #26
On Mon, 16 Apr 2018, Mike Snitzer wrote:

> On Mon, Apr 16 2018 at  8:38am -0400,
> Vlastimil Babka <vbabka@suse.cz> wrote:
> 
> > On 04/13/2018 05:10 PM, Mike Snitzer wrote:
> > > On Fri, Apr 13 2018 at  5:22am -0400,
> > > Vlastimil Babka <vbabka@suse.cz> wrote:
> > >>
> > >> Would this perhaps be a good LSF/MM discussion topic? Mikulas, are you
> > >> attending, or anyone else that can vouch for your usecase?
> > > 
> > > Any further discussion on SLAB_MINIMIZE_WASTE should continue on list.
> > > 
> > > Mikulas won't be at LSF/MM.  But I included Mikulas' dm-bufio changes
> > > that no longer depend on this proposed SLAB_MINIMIZE_WASTE (as part of
> > > the 4.17 merge window).
> > 
> > Can you or Mikulas briefly summarize how the dependency is avoided, and
> > whether if (something like) SLAB_MINIMIZE_WASTE were implemented, the
> > dm-bufio code would happily switch to it, or not?
> 
> git log eeb67a0ba04df^..45354f1eb67224669a1 -- drivers/md/dm-bufio.c
> 
> But the most signficant commit relative to SLAB_MINIMIZE_WASTE is: 
> 359dbf19ab524652a2208a2a2cddccec2eede2ad ("dm bufio: use slab cache for 
> dm_buffer structure allocations")
> 
> So no, I don't see why dm-bufio would need to switch to
> SLAB_MINIMIZE_WASTE if it were introduced in the future.

Currently, the slab cache rounds up the size of the slab to the next power 
of two (if the size is large). And that wastes memory if that memory were 
to be used for deduplication tables.

Generally, the performance of the deduplication solution depends on how 
much data can you put to memory. If you round 640KB buffer to 1MB (this is 
what the slab and slub subsystem currently do), you waste a lot of memory. 
Deduplication indices with 640KB blocks are already used in the wild, so 
it can't be easily changed.

> Mike

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Snitzer April 16, 2018, 2:46 p.m. UTC | #27
On Mon, Apr 16 2018 at 10:37am -0400,
Mikulas Patocka <mpatocka@redhat.com> wrote:

> 
> 
> On Mon, 16 Apr 2018, Mike Snitzer wrote:
> 
> > On Mon, Apr 16 2018 at  8:38am -0400,
> > Vlastimil Babka <vbabka@suse.cz> wrote:
> > 
> > > On 04/13/2018 05:10 PM, Mike Snitzer wrote:
> > > > On Fri, Apr 13 2018 at  5:22am -0400,
> > > > Vlastimil Babka <vbabka@suse.cz> wrote:
> > > >>
> > > >> Would this perhaps be a good LSF/MM discussion topic? Mikulas, are you
> > > >> attending, or anyone else that can vouch for your usecase?
> > > > 
> > > > Any further discussion on SLAB_MINIMIZE_WASTE should continue on list.
> > > > 
> > > > Mikulas won't be at LSF/MM.  But I included Mikulas' dm-bufio changes
> > > > that no longer depend on this proposed SLAB_MINIMIZE_WASTE (as part of
> > > > the 4.17 merge window).
> > > 
> > > Can you or Mikulas briefly summarize how the dependency is avoided, and
> > > whether if (something like) SLAB_MINIMIZE_WASTE were implemented, the
> > > dm-bufio code would happily switch to it, or not?
> > 
> > git log eeb67a0ba04df^..45354f1eb67224669a1 -- drivers/md/dm-bufio.c
> > 
> > But the most signficant commit relative to SLAB_MINIMIZE_WASTE is: 
> > 359dbf19ab524652a2208a2a2cddccec2eede2ad ("dm bufio: use slab cache for 
> > dm_buffer structure allocations")
> > 
> > So no, I don't see why dm-bufio would need to switch to
> > SLAB_MINIMIZE_WASTE if it were introduced in the future.
> 
> Currently, the slab cache rounds up the size of the slab to the next power 
> of two (if the size is large). And that wastes memory if that memory were 
> to be used for deduplication tables.

You mean on an overall size of the cache level?  Or on a per-object
level?  I can only imagine you mean the former.
 
> Generally, the performance of the deduplication solution depends on how 
> much data can you put to memory. If you round 640KB buffer to 1MB (this is 
> what the slab and slub subsystem currently do), you waste a lot of memory. 
> Deduplication indices with 640KB blocks are already used in the wild, so 
> it can't be easily changed.

OK, seems you're suggesting a single object is rounded up.. so then this
header is very wrong?:

commit 359dbf19ab524652a2208a2a2cddccec2eede2ad
Author: Mikulas Patocka <mpatocka@redhat.com>
Date:   Mon Mar 26 20:29:45 2018 +0200

    dm bufio: use slab cache for dm_buffer structure allocations

    kmalloc padded to the next power of two, using a slab cache avoids this.

    Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
    Signed-off-by: Mike Snitzer <snitzer@redhat.com>

Please clarify further, thanks!
Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka April 16, 2018, 2:57 p.m. UTC | #28
On Mon, 16 Apr 2018, Mike Snitzer wrote:

> On Mon, Apr 16 2018 at 10:37am -0400,
> Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > 
> > 
> > On Mon, 16 Apr 2018, Mike Snitzer wrote:
> > 
> > > On Mon, Apr 16 2018 at  8:38am -0400,
> > > Vlastimil Babka <vbabka@suse.cz> wrote:
> > > 
> > > > On 04/13/2018 05:10 PM, Mike Snitzer wrote:
> > > > > On Fri, Apr 13 2018 at  5:22am -0400,
> > > > > Vlastimil Babka <vbabka@suse.cz> wrote:
> > > > >>
> > > > >> Would this perhaps be a good LSF/MM discussion topic? Mikulas, are you
> > > > >> attending, or anyone else that can vouch for your usecase?
> > > > > 
> > > > > Any further discussion on SLAB_MINIMIZE_WASTE should continue on list.
> > > > > 
> > > > > Mikulas won't be at LSF/MM.  But I included Mikulas' dm-bufio changes
> > > > > that no longer depend on this proposed SLAB_MINIMIZE_WASTE (as part of
> > > > > the 4.17 merge window).
> > > > 
> > > > Can you or Mikulas briefly summarize how the dependency is avoided, and
> > > > whether if (something like) SLAB_MINIMIZE_WASTE were implemented, the
> > > > dm-bufio code would happily switch to it, or not?
> > > 
> > > git log eeb67a0ba04df^..45354f1eb67224669a1 -- drivers/md/dm-bufio.c
> > > 
> > > But the most signficant commit relative to SLAB_MINIMIZE_WASTE is: 
> > > 359dbf19ab524652a2208a2a2cddccec2eede2ad ("dm bufio: use slab cache for 
> > > dm_buffer structure allocations")
> > > 
> > > So no, I don't see why dm-bufio would need to switch to
> > > SLAB_MINIMIZE_WASTE if it were introduced in the future.
> > 
> > Currently, the slab cache rounds up the size of the slab to the next power 
> > of two (if the size is large). And that wastes memory if that memory were 
> > to be used for deduplication tables.
> 
> You mean on an overall size of the cache level?  Or on a per-object
> level?  I can only imagine you mean the former.

Unfortunatelly, it rounds up every object. So, if you have six 640KB 
objects, it consumes 6MB.

> > Generally, the performance of the deduplication solution depends on how 
> > much data can you put to memory. If you round 640KB buffer to 1MB (this is 
> > what the slab and slub subsystem currently do), you waste a lot of memory. 
> > Deduplication indices with 640KB blocks are already used in the wild, so 
> > it can't be easily changed.
> 
> OK, seems you're suggesting a single object is rounded up.. so then this
> header is very wrong?:
> 
> commit 359dbf19ab524652a2208a2a2cddccec2eede2ad
> Author: Mikulas Patocka <mpatocka@redhat.com>
> Date:   Mon Mar 26 20:29:45 2018 +0200
> 
>     dm bufio: use slab cache for dm_buffer structure allocations
> 
>     kmalloc padded to the next power of two, using a slab cache avoids this.
> 
>     Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
>     Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> 
> Please clarify further, thanks!
> Mike

Yes, using a slab cache currently doesn't avoid this rouding (it needs the 
SLAB_MINIMIZE_WASTE patch to do that).

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Christoph Lameter (Ampere) April 16, 2018, 3:18 p.m. UTC | #29
On Mon, 16 Apr 2018, Mikulas Patocka wrote:

> > Please clarify further, thanks!
> > Mike
>
> Yes, using a slab cache currently doesn't avoid this rouding (it needs the
> SLAB_MINIMIZE_WASTE patch to do that).

Or an increase in slab_max_order

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka April 16, 2018, 3:25 p.m. UTC | #30
On Mon, 16 Apr 2018, Christopher Lameter wrote:

> On Mon, 16 Apr 2018, Mikulas Patocka wrote:
> 
> > > Please clarify further, thanks!
> > > Mike
> >
> > Yes, using a slab cache currently doesn't avoid this rouding (it needs the
> > SLAB_MINIMIZE_WASTE patch to do that).
> 
> Or an increase in slab_max_order

But that will increase it for all slabs (often senselessly - i.e. 
kmalloc-4096 would have order 4MB).

I need to increase it just for dm-bufio slabs.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Christoph Lameter (Ampere) April 16, 2018, 3:45 p.m. UTC | #31
On Mon, 16 Apr 2018, Mikulas Patocka wrote:

> >
> > Or an increase in slab_max_order
>
> But that will increase it for all slabs (often senselessly - i.e.
> kmalloc-4096 would have order 4MB).

4MB? Nope.... That is a power of two slab so no wasted space even with
order 0.

Its not a senseless increase. The more objects you fit into a slab page
the higher the performance of the allocator.


> I need to increase it just for dm-bufio slabs.

If you do this then others will want the same...

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka April 16, 2018, 7:36 p.m. UTC | #32
On Mon, 16 Apr 2018, Christopher Lameter wrote:

> On Mon, 16 Apr 2018, Mikulas Patocka wrote:
> 
> > >
> > > Or an increase in slab_max_order
> >
> > But that will increase it for all slabs (often senselessly - i.e.
> > kmalloc-4096 would have order 4MB).
> 
> 4MB? Nope.... That is a power of two slab so no wasted space even with
> order 0.

See this email:
https://www.redhat.com/archives/dm-devel/2018-March/msg00387.html

If you boot with slub_max_order=10, the kmalloc-8192 cache has 64 pages. 
So yes, it increases the order of all slab caches (although not up to 
4MB).

> Its not a senseless increase. The more objects you fit into a slab page
> the higher the performance of the allocator.
> 
> 
> > I need to increase it just for dm-bufio slabs.
> 
> If you do this then others will want the same...

If others need it, they can turn on the flag SLAB_MINIMIZE_WASTE too.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Vlastimil Babka April 16, 2018, 7:38 p.m. UTC | #33
On 04/16/2018 04:37 PM, Mikulas Patocka wrote:
>>> Can you or Mikulas briefly summarize how the dependency is avoided, and
>>> whether if (something like) SLAB_MINIMIZE_WASTE were implemented, the
>>> dm-bufio code would happily switch to it, or not?
>>
>> git log eeb67a0ba04df^..45354f1eb67224669a1 -- drivers/md/dm-bufio.c
>>
>> But the most signficant commit relative to SLAB_MINIMIZE_WASTE is: 
>> 359dbf19ab524652a2208a2a2cddccec2eede2ad ("dm bufio: use slab cache for 
>> dm_buffer structure allocations")
>>
>> So no, I don't see why dm-bufio would need to switch to
>> SLAB_MINIMIZE_WASTE if it were introduced in the future.
> 
> Currently, the slab cache rounds up the size of the slab to the next power 
> of two (if the size is large). And that wastes memory if that memory were 
> to be used for deduplication tables.
> 
> Generally, the performance of the deduplication solution depends on how 
> much data can you put to memory. If you round 640KB buffer to 1MB (this is 
> what the slab and slub subsystem currently do), you waste a lot of memory. 
> Deduplication indices with 640KB blocks are already used in the wild, so 
> it can't be easily changed.

Thank you both for the clarification.

Vlastimil

> 
>> Mike
> 
> Mikulas
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Vlastimil Babka April 16, 2018, 7:53 p.m. UTC | #34
On 04/16/2018 09:36 PM, Mikulas Patocka wrote:
> 
> 
> On Mon, 16 Apr 2018, Christopher Lameter wrote:
> 
>> On Mon, 16 Apr 2018, Mikulas Patocka wrote:
>>
>>>>
>>>> Or an increase in slab_max_order
>>>
>>> But that will increase it for all slabs (often senselessly - i.e.
>>> kmalloc-4096 would have order 4MB).
>>
>> 4MB? Nope.... That is a power of two slab so no wasted space even with
>> order 0.
> 
> See this email:
> https://www.redhat.com/archives/dm-devel/2018-March/msg00387.html
> 
> If you boot with slub_max_order=10, the kmalloc-8192 cache has 64 pages. 
> So yes, it increases the order of all slab caches (although not up to 
> 4MB).
> 
>> Its not a senseless increase. The more objects you fit into a slab page
>> the higher the performance of the allocator.

It's not universally without a cost. It might increase internal
fragmentation of the slabs, if you end up with lots of 4MB pages
containing just few objects. Thus, waste of memory. You also consume
high-order pages that could be used elsewhere. If you fail to allocate
4MB, then what's the fallback, order-0? I doubt it's "the highest
available order". Thus, a more conservative choice e.g. order-3 will
might succeed more in allocating order-3, while a choice of 4MB will
have many order-0 fallbacks.

>>> I need to increase it just for dm-bufio slabs.
>>
>> If you do this then others will want the same...
> 
> If others need it, they can turn on the flag SLAB_MINIMIZE_WASTE too.

I think it should be possible without a new flag. The slub allocator
could just balance priorities (performance vs memory efficiency) better.
Currently I get the impression that "slub_max_order" is a performance
tunable. Let's add another criteria for selecting an order, that would
try to pick an order to minimize wasted space below e.g. 10% with some
different kind of max order. Pick good defaults, add tunables if you must.

I mean, anyone who's creating a cache for 640KB objects most likely
doesn't want to waste another 384KB by each such object. They shouldn't
have to add a flag to let the slub allocator figure out that using 2MB
pages is the right thing to do here.

Vlastimil

> Mikulas
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka April 16, 2018, 9:01 p.m. UTC | #35
On Mon, 16 Apr 2018, Vlastimil Babka wrote:

> On 04/16/2018 09:36 PM, Mikulas Patocka wrote:
> 
> >>> I need to increase it just for dm-bufio slabs.
> >>
> >> If you do this then others will want the same...
> > 
> > If others need it, they can turn on the flag SLAB_MINIMIZE_WASTE too.
> 
> I think it should be possible without a new flag. The slub allocator
> could just balance priorities (performance vs memory efficiency) better.
> Currently I get the impression that "slub_max_order" is a performance
> tunable. Let's add another criteria for selecting an order, that would
> try to pick an order to minimize wasted space below e.g. 10% with some
> different kind of max order. Pick good defaults, add tunables if you must.
> 
> I mean, anyone who's creating a cache for 640KB objects most likely
> doesn't want to waste another 384KB by each such object. They shouldn't
> have to add a flag to let the slub allocator figure out that using 2MB
> pages is the right thing to do here.
> 
> Vlastimil

The problem is that higher-order allocations (larger than 32K) are 
unreliable. So, if you increase page order beyond that, the allocation may 
randomly fail.

dm-bufio deals gracefully with allocation failure, because it preallocates 
some buffers with vmalloc, but other subsystems may not deal with it and 
they cound return ENOMEM randomly or misbehave in other ways. So, the 
"SLAB_MINIMIZE_WASTE" flag is also saying that the allocation may fail and 
the caller is prepared to deal with it.

The slub subsystem does actual fallback to low-order when the allocation 
fails (it allows different order for each slab in the same cache), but 
slab doesn't fallback and you get NULL if higher-order allocation fails. 
So, SLAB_MINIMIZE_WASTE is needed for slab because it will just randomly 
fail with higher order.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka April 16, 2018, 9:04 p.m. UTC | #36
On Mon, 16 Apr 2018, Vlastimil Babka wrote:

> On 04/13/2018 05:10 PM, Mike Snitzer wrote:
> > On Fri, Apr 13 2018 at  5:22am -0400,
> > Vlastimil Babka <vbabka@suse.cz> wrote:
> >>
> >> Would this perhaps be a good LSF/MM discussion topic? Mikulas, are you
> >> attending, or anyone else that can vouch for your usecase?
> > 
> > Any further discussion on SLAB_MINIMIZE_WASTE should continue on list.
> > 
> > Mikulas won't be at LSF/MM.  But I included Mikulas' dm-bufio changes
> > that no longer depend on this proposed SLAB_MINIMIZE_WASTE (as part of
> > the 4.17 merge window).
> 
> Can you or Mikulas briefly summarize how the dependency is avoided, and

This was Mike's misconception - dm-bufio actually needs the 
SLAB_MINIMIZE_WASTE patch, otherwise it is wasting memory.

> whether if (something like) SLAB_MINIMIZE_WASTE were implemented, the
> dm-bufio code would happily switch to it, or not?
> 
> Thanks,
> Vlastimil

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Christoph Lameter (Ampere) April 17, 2018, 2:40 p.m. UTC | #37
On Mon, 16 Apr 2018, Mikulas Patocka wrote:

> dm-bufio deals gracefully with allocation failure, because it preallocates
> some buffers with vmalloc, but other subsystems may not deal with it and
> they cound return ENOMEM randomly or misbehave in other ways. So, the
> "SLAB_MINIMIZE_WASTE" flag is also saying that the allocation may fail and
> the caller is prepared to deal with it.
>
> The slub subsystem does actual fallback to low-order when the allocation
> fails (it allows different order for each slab in the same cache), but
> slab doesn't fallback and you get NULL if higher-order allocation fails.
> So, SLAB_MINIMIZE_WASTE is needed for slab because it will just randomly
> fail with higher order.

Fix Slab instead of adding a flag that is only useful for one allocator?


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Christoph Lameter (Ampere) April 17, 2018, 2:47 p.m. UTC | #38
On Mon, 16 Apr 2018, Mikulas Patocka wrote:

> If you boot with slub_max_order=10, the kmalloc-8192 cache has 64 pages.
> So yes, it increases the order of all slab caches (although not up to
> 4MB).

Hmmm... Ok. There is another setting slub_min_objects that controls how
many objects to fit into a slab page.

We could change the allocation scheme so that it finds the mininum order
with the minimum waste. Allocator performance will drop though since fewer
object are then in a slab page.



--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Christoph Lameter (Ampere) April 17, 2018, 2:49 p.m. UTC | #39
On Mon, 16 Apr 2018, Vlastimil Babka wrote:

> >> Its not a senseless increase. The more objects you fit into a slab page
> >> the higher the performance of the allocator.
>
> It's not universally without a cost. It might increase internal
> fragmentation of the slabs, if you end up with lots of 4MB pages
> containing just few objects. Thus, waste of memory. You also consume
> high-order pages that could be used elsewhere. If you fail to allocate
> 4MB, then what's the fallback, order-0? I doubt it's "the highest
> available order". Thus, a more conservative choice e.g. order-3 will
> might succeed more in allocating order-3, while a choice of 4MB will
> have many order-0 fallbacks.

Obviously there is a cost. But here the subsystem has a fallback.

> I think it should be possible without a new flag. The slub allocator
> could just balance priorities (performance vs memory efficiency) better.
> Currently I get the impression that "slub_max_order" is a performance
> tunable. Let's add another criteria for selecting an order, that would
> try to pick an order to minimize wasted space below e.g. 10% with some
> different kind of max order. Pick good defaults, add tunables if you must.

There is also slub_min_objects.

> I mean, anyone who's creating a cache for 640KB objects most likely
> doesn't want to waste another 384KB by each such object. They shouldn't
> have to add a flag to let the slub allocator figure out that using 2MB
> pages is the right thing to do here.

I agree if we do this then preferably without a flag.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mikulas Patocka April 17, 2018, 6:53 p.m. UTC | #40
On Tue, 17 Apr 2018, Christopher Lameter wrote:

> On Mon, 16 Apr 2018, Mikulas Patocka wrote:
> 
> > dm-bufio deals gracefully with allocation failure, because it preallocates
> > some buffers with vmalloc, but other subsystems may not deal with it and
> > they cound return ENOMEM randomly or misbehave in other ways. So, the
> > "SLAB_MINIMIZE_WASTE" flag is also saying that the allocation may fail and
> > the caller is prepared to deal with it.
> >
> > The slub subsystem does actual fallback to low-order when the allocation
> > fails (it allows different order for each slab in the same cache), but
> > slab doesn't fallback and you get NULL if higher-order allocation fails.
> > So, SLAB_MINIMIZE_WASTE is needed for slab because it will just randomly
> > fail with higher order.
> 
> Fix Slab instead of adding a flag that is only useful for one allocator?

Slab assumes that all slabs have the same order, so it's not so easy to 
fix it.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Christoph Lameter (Ampere) April 17, 2018, 9:42 p.m. UTC | #41
On Tue, 17 Apr 2018, Mikulas Patocka wrote:

> > > The slub subsystem does actual fallback to low-order when the allocation
> > > fails (it allows different order for each slab in the same cache), but
> > > slab doesn't fallback and you get NULL if higher-order allocation fails.
> > > So, SLAB_MINIMIZE_WASTE is needed for slab because it will just randomly
> > > fail with higher order.
> >
> > Fix Slab instead of adding a flag that is only useful for one allocator?
>
> Slab assumes that all slabs have the same order, so it's not so easy to
> fix it.

Well since SLAB uses compound pages one could easily determine the order
of the page and thus also support multiple page orders there.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
diff mbox

Patch

Index: linux-2.6/include/linux/slab.h
===================================================================
--- linux-2.6.orig/include/linux/slab.h	2018-03-20 14:59:20.528030000 +0100
+++ linux-2.6/include/linux/slab.h	2018-03-20 14:59:20.518030000 +0100
@@ -108,6 +108,13 @@ 
 #define SLAB_KASAN		0
 #endif
 
+/*
+ * Use higer order allocations to minimize wasted space.
+ * Note: the allocation is unreliable if this flag is used, the caller
+ * must handle allocation failures gracefully.
+ */
+#define SLAB_MINIMIZE_WASTE	((slab_flags_t __force)0x10000000U)
+
 /* The following flags affect the page allocator grouping pages by mobility */
 /* Objects are reclaimable */
 #define SLAB_RECLAIM_ACCOUNT	((slab_flags_t __force)0x00020000U)
Index: linux-2.6/mm/slab_common.c
===================================================================
--- linux-2.6.orig/mm/slab_common.c	2018-03-20 14:59:20.528030000 +0100
+++ linux-2.6/mm/slab_common.c	2018-03-20 14:59:20.518030000 +0100
@@ -52,7 +52,7 @@  static DECLARE_WORK(slab_caches_to_rcu_d
 		SLAB_FAILSLAB | SLAB_KASAN)
 
 #define SLAB_MERGE_SAME (SLAB_RECLAIM_ACCOUNT | SLAB_CACHE_DMA | \
-			 SLAB_ACCOUNT)
+			 SLAB_ACCOUNT | SLAB_MINIMIZE_WASTE)
 
 /*
  * Merge control. If this is set then no merging of slab caches will occur.
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2018-03-20 14:59:20.528030000 +0100
+++ linux-2.6/mm/slub.c	2018-03-20 14:59:20.518030000 +0100
@@ -3234,7 +3234,7 @@  static inline int slab_order(int size, i
 	return order;
 }
 
-static inline int calculate_order(int size, int reserved)
+static inline int calculate_order(int size, int reserved, slab_flags_t flags)
 {
 	int order;
 	int min_objects;
@@ -3261,7 +3261,7 @@  static inline int calculate_order(int si
 			order = slab_order(size, min_objects,
 					slub_max_order, fraction, reserved);
 			if (order <= slub_max_order)
-				return order;
+				goto ret_order;
 			fraction /= 2;
 		}
 		min_objects--;
@@ -3273,15 +3273,30 @@  static inline int calculate_order(int si
 	 */
 	order = slab_order(size, 1, slub_max_order, 1, reserved);
 	if (order <= slub_max_order)
-		return order;
+		goto ret_order;
 
 	/*
 	 * Doh this slab cannot be placed using slub_max_order.
 	 */
 	order = slab_order(size, 1, MAX_ORDER, 1, reserved);
 	if (order < MAX_ORDER)
-		return order;
+		goto ret_order;
 	return -ENOSYS;
+
+ret_order:
+	if (flags & SLAB_MINIMIZE_WASTE) {
+		/* Increase the order if it decreases waste */
+		int test_order;
+		for (test_order = order + 1; test_order < MAX_ORDER; test_order++) {
+			unsigned long order_objects = ((PAGE_SIZE << order) - reserved) / size;
+			unsigned long test_order_objects = ((PAGE_SIZE << test_order) - reserved) / size;
+			if (test_order_objects >= min(64, MAX_OBJS_PER_PAGE))
+				break;
+			if (test_order_objects > order_objects << (test_order - order))
+				order = test_order;
+		}
+	}
+	return order;
 }
 
 static void
@@ -3546,7 +3561,7 @@  static int calculate_sizes(struct kmem_c
 	if (forced_order >= 0)
 		order = forced_order;
 	else
-		order = calculate_order(size, s->reserved);
+		order = calculate_order(size, s->reserved, flags);
 
 	if (order < 0)
 		return 0;
Index: linux-2.6/mm/slab.h
===================================================================
--- linux-2.6.orig/mm/slab.h	2018-03-20 14:59:20.528030000 +0100
+++ linux-2.6/mm/slab.h	2018-03-20 14:59:20.518030000 +0100
@@ -142,10 +142,10 @@  static inline slab_flags_t kmem_cache_fl
 #if defined(CONFIG_SLAB)
 #define SLAB_CACHE_FLAGS (SLAB_MEM_SPREAD | SLAB_NOLEAKTRACE | \
 			  SLAB_RECLAIM_ACCOUNT | SLAB_TEMPORARY | \
-			  SLAB_ACCOUNT)
+			  SLAB_ACCOUNT | SLAB_MINIMIZE_WASTE)
 #elif defined(CONFIG_SLUB)
 #define SLAB_CACHE_FLAGS (SLAB_NOLEAKTRACE | SLAB_RECLAIM_ACCOUNT | \
-			  SLAB_TEMPORARY | SLAB_ACCOUNT)
+			  SLAB_TEMPORARY | SLAB_ACCOUNT | SLAB_MINIMIZE_WASTE)
 #else
 #define SLAB_CACHE_FLAGS (0)
 #endif
@@ -164,7 +164,8 @@  static inline slab_flags_t kmem_cache_fl
 			      SLAB_NOLEAKTRACE | \
 			      SLAB_RECLAIM_ACCOUNT | \
 			      SLAB_TEMPORARY | \
-			      SLAB_ACCOUNT)
+			      SLAB_ACCOUNT | \
+			      SLAB_MINIMIZE_WASTE)
 
 int __kmem_cache_shutdown(struct kmem_cache *);
 void __kmem_cache_release(struct kmem_cache *);
Index: linux-2.6/mm/slab.c
===================================================================
--- linux-2.6.orig/mm/slab.c	2018-03-20 14:59:20.528030000 +0100
+++ linux-2.6/mm/slab.c	2018-03-20 14:59:20.518030000 +0100
@@ -1789,14 +1789,14 @@  static size_t calculate_slab_order(struc
 		 * as GFP_NOFS and we really don't want to have to be allocating
 		 * higher-order pages when we are unable to shrink dcache.
 		 */
-		if (flags & SLAB_RECLAIM_ACCOUNT)
+		if (flags & SLAB_RECLAIM_ACCOUNT && !(flags & SLAB_MINIMIZE_WASTE))
 			break;
 
 		/*
 		 * Large number of objects is good, but very large slabs are
 		 * currently bad for the gfp()s.
 		 */
-		if (gfporder >= slab_max_order)
+		if (gfporder >= slab_max_order && !(flags & SLAB_MINIMIZE_WASTE))
 			break;
 
 		/*