Message ID | alpine.LRH.2.02.1803200954590.18995@file01.intranet.prod.int.rdu2.redhat.com (mailing list archive) |
---|---|
State | Superseded, archived |
Delegated to: | Mike Snitzer |
Headers | show |
On Tue, Mar 20, 2018 at 01:25:09PM -0400, Mikulas Patocka wrote: > The reason why we need this is that we are going to merge code that does > block device deduplication (it was developed separatedly and sold as a > commercial product), and the code uses block sizes that are not a power of > two (block sizes 192K, 448K, 640K, 832K are used in the wild). The slab > allocator rounds up the allocation to the nearest power of two, but that > wastes a lot of memory. Performance of the solution depends on efficient > memory usage, so we should minimize wasted as much as possible. The SLUB allocator also falls back to using the page (buddy) allocator for allocations above 8kB, so this patch is going to have no effect on slub. You'd be better off using alloc_pages_exact() for this kind of size, or managing your own pool of pages by using something like five 192k blocks in a 1MB allocation. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, 20 Mar 2018, Matthew Wilcox wrote: > On Tue, Mar 20, 2018 at 01:25:09PM -0400, Mikulas Patocka wrote: > > The reason why we need this is that we are going to merge code that does > > block device deduplication (it was developed separatedly and sold as a > > commercial product), and the code uses block sizes that are not a power of > > two (block sizes 192K, 448K, 640K, 832K are used in the wild). The slab > > allocator rounds up the allocation to the nearest power of two, but that > > wastes a lot of memory. Performance of the solution depends on efficient > > memory usage, so we should minimize wasted as much as possible. > > The SLUB allocator also falls back to using the page (buddy) allocator > for allocations above 8kB, so this patch is going to have no effect on > slub. You'd be better off using alloc_pages_exact() for this kind of > size, or managing your own pool of pages by using something like five > 192k blocks in a 1MB allocation. The fallback is only effective for kmalloc caches. Manually created caches do not follow this rule. Note that you can already control the page orders for allocation and the objects per slab using slub_min_order slub_max_order slub_min_objects This is documented in linux/Documentation/vm/slub.txt Maybe do the same thing for SLAB? -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, 20 Mar 2018, Christopher Lameter wrote: > On Tue, 20 Mar 2018, Matthew Wilcox wrote: > > > On Tue, Mar 20, 2018 at 01:25:09PM -0400, Mikulas Patocka wrote: > > > The reason why we need this is that we are going to merge code that does > > > block device deduplication (it was developed separatedly and sold as a > > > commercial product), and the code uses block sizes that are not a power of > > > two (block sizes 192K, 448K, 640K, 832K are used in the wild). The slab > > > allocator rounds up the allocation to the nearest power of two, but that > > > wastes a lot of memory. Performance of the solution depends on efficient > > > memory usage, so we should minimize wasted as much as possible. > > > > The SLUB allocator also falls back to using the page (buddy) allocator > > for allocations above 8kB, so this patch is going to have no effect on > > slub. You'd be better off using alloc_pages_exact() for this kind of > > size, or managing your own pool of pages by using something like five > > 192k blocks in a 1MB allocation. > > The fallback is only effective for kmalloc caches. Manually created caches > do not follow this rule. Yes - the dm-bufio layer uses manually created caches. > Note that you can already control the page orders for allocation and > the objects per slab using > > slub_min_order > slub_max_order > slub_min_objects > > This is documented in linux/Documentation/vm/slub.txt > > Maybe do the same thing for SLAB? Yes, but I need to change it for a specific cache, not for all caches. When the order is greater than 3 (PAGE_ALLOC_COSTLY_ORDER), the allocation becomes unreliable, thus it is a bad idea to increase slub_max_order system-wide. Another problem with slub_max_order is that it would pad all caches to slub_max_order, even those that already have a power-of-two size (in that case, the padding is counterproductive). BTW. the function "order_store" in mm/slub.c modifies the structure kmem_cache without taking any locks - is it a bug? Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, 20 Mar 2018, Mikulas Patocka wrote: > > Maybe do the same thing for SLAB? > > Yes, but I need to change it for a specific cache, not for all caches. Why only some caches? > When the order is greater than 3 (PAGE_ALLOC_COSTLY_ORDER), the allocation > becomes unreliable, thus it is a bad idea to increase slub_max_order > system-wide. Well the allocations is more likely to fail that is true but SLUB will fall back to a smaller order should the page allocator refuse to give us that larger sized page. > Another problem with slub_max_order is that it would pad all caches to > slub_max_order, even those that already have a power-of-two size (in that > case, the padding is counterproductive). No it does not. Slub will calculate the configuration with the least byte wastage. It is not the standard order but the maximum order to be used. Power of two caches below PAGE_SIZE will have order 0. There are some corner cases where extra metadata is needed per object or per page that will result in either object sizes that are no longer a power of two or in page sizes smaller than the whole page. Maybe you have a case like that? Can you show me a cache that has this issue? > BTW. the function "order_store" in mm/slub.c modifies the structure > kmem_cache without taking any locks - is it a bug? The kmem_cache structure was just allocated. Only one thread can access it thus no locking is necessary. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, 20 Mar 2018, Christopher Lameter wrote: > On Tue, 20 Mar 2018, Mikulas Patocka wrote: > > > > Maybe do the same thing for SLAB? > > > > Yes, but I need to change it for a specific cache, not for all caches. > > Why only some caches? I need high order for the buffer cache that holds the deduplicated data. I don't need to force it system-wide. > > When the order is greater than 3 (PAGE_ALLOC_COSTLY_ORDER), the allocation > > becomes unreliable, thus it is a bad idea to increase slub_max_order > > system-wide. > > Well the allocations is more likely to fail that is true but SLUB will > fall back to a smaller order should the page allocator refuse to give us > that larger sized page. Does SLAB have this fall-back too? > > Another problem with slub_max_order is that it would pad all caches to > > slub_max_order, even those that already have a power-of-two size (in that > > case, the padding is counterproductive). > > No it does not. Slub will calculate the configuration with the least byte > wastage. It is not the standard order but the maximum order to be used. > Power of two caches below PAGE_SIZE will have order 0. Try to boot with slub_max_order=10 and you can see this in /proc/slabinfo: kmalloc-8192 352 352 8192 32 64 : tunables 0 0 0 : slabdata 11 11 0 ^^^^ So it rounds up power-of-two sizes to high orders unnecessarily. Without slub_max_order=10, the number of pages for the kmalloc-8192 cache is just 8. I observe the same pathological rounding in dm-bufio caches. > There are some corner cases where extra metadata is needed per object or > per page that will result in either object sizes that are no longer a > power of two or in page sizes smaller than the whole page. Maybe you have > a case like that? Can you show me a cache that has this issue? Here I have a patch set that changes the dm-bufio subsystem to support buffer sizes that are not a power of two: http://people.redhat.com/~mpatocka/patches/kernel/dm-bufio-arbitrary-sector-size/ I need to change the slub cache to minimize wasted space - i.e. when asking for a slab cache for 640kB objects, the slub system currently allocates 1MB per object and 384kB is wasted. This is the reason why I'm making this patch. > > BTW. the function "order_store" in mm/slub.c modifies the structure > > kmem_cache without taking any locks - is it a bug? > > The kmem_cache structure was just allocated. Only one thread can access it > thus no locking is necessary. No - order_store is called when writing to /sys/kernel/slab/<cache>/order - you can modify order for any existing cache - and the modification happens without any locking. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, 20 Mar 2018, Mikulas Patocka wrote: > > > Another problem with slub_max_order is that it would pad all caches to > > > slub_max_order, even those that already have a power-of-two size (in that > > > case, the padding is counterproductive). > > > > No it does not. Slub will calculate the configuration with the least byte > > wastage. It is not the standard order but the maximum order to be used. > > Power of two caches below PAGE_SIZE will have order 0. > > Try to boot with slub_max_order=10 and you can see this in /proc/slabinfo: > kmalloc-8192 352 352 8192 32 64 : tunables 0 0 0 : slabdata 11 11 0 Yes it tries to create a slab size that will accomodate the minimum objects per slab. > So it rounds up power-of-two sizes to high orders unnecessarily. Without > slub_max_order=10, the number of pages for the kmalloc-8192 cache is just > 8. The kmalloc-8192 has 4 objects per slab on my system which means an allocation size of 32k = order 4. In this case 4 objects fit tightly into a slab. There is no waste. But then I thought you were talking about manually created slabs not about the kmalloc array? > I observe the same pathological rounding in dm-bufio caches. > > > There are some corner cases where extra metadata is needed per object or > > per page that will result in either object sizes that are no longer a > > power of two or in page sizes smaller than the whole page. Maybe you have > > a case like that? Can you show me a cache that has this issue? > > Here I have a patch set that changes the dm-bufio subsystem to support > buffer sizes that are not a power of two: > http://people.redhat.com/~mpatocka/patches/kernel/dm-bufio-arbitrary-sector-size/ > > I need to change the slub cache to minimize wasted space - i.e. when > asking for a slab cache for 640kB objects, the slub system currently > allocates 1MB per object and 384kB is wasted. This is the reason why I'm > making this patch. You should not be using the slab allocators for these. Allocate higher order pages or numbers of consecutive smaller pagess from the page allocator. The slab allocators are written for objects smaller than page size. > > > BTW. the function "order_store" in mm/slub.c modifies the structure > > > kmem_cache without taking any locks - is it a bug? > > > > The kmem_cache structure was just allocated. Only one thread can access it > > thus no locking is necessary. > > No - order_store is called when writing to /sys/kernel/slab/<cache>/order > - you can modify order for any existing cache - and the modification > happens without any locking. Well it still does not matter. The size of the order of slab pages can be dynamic even within a slab. You can have pages of varying sizes. What kind of problem could be caused here? -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, 21 Mar 2018, Christopher Lameter wrote: > On Tue, 20 Mar 2018, Mikulas Patocka wrote: > > > > > Another problem with slub_max_order is that it would pad all caches to > > > > slub_max_order, even those that already have a power-of-two size (in that > > > > case, the padding is counterproductive). > > > > > > No it does not. Slub will calculate the configuration with the least byte > > > wastage. It is not the standard order but the maximum order to be used. > > > Power of two caches below PAGE_SIZE will have order 0. > > > > Try to boot with slub_max_order=10 and you can see this in /proc/slabinfo: > > kmalloc-8192 352 352 8192 32 64 : tunables 0 0 0 : slabdata 11 11 0 > > Yes it tries to create a slab size that will accomodate the minimum > objects per slab. > > > So it rounds up power-of-two sizes to high orders unnecessarily. Without > > slub_max_order=10, the number of pages for the kmalloc-8192 cache is just > > 8. > > The kmalloc-8192 has 4 objects per slab on my system which means an > allocation size of 32k = order 4. > > In this case 4 objects fit tightly into a slab. There is no waste. > > But then I thought you were talking about manually created slabs not > about the kmalloc array? For some workloads, dm-bufio needs caches with sizes that are a power of two (majority of workloads fall into this cathegory). For other workloads dm-bufio needs caches with sizes that are not a power of two. Now - we don't want higher-order allocations for power-of-two caches (because higher-order allocations just cause memory fragmentation without any benefit), but we want higher-order allocations for non-power-of-two caches (because higher-order allocations minimize wasted space). For example: for 192K block size, the ideal order is 4MB (it takes 21 blocks) for 448K block size, the ideal order is 4MB (it takes 9 blocks) for 512K block size, the ideal order is 512KB (there is no benefit from using higher order) for 640K block size, the ideal order is 2MB (it takes 3 blocks, increasing the allocation size to 4MB doesn't result in any benefit) for 832K block size, the ideal order is 1MB (it takes 1 block, increasing the allocation to 2MB or 4MB doesn't result in any benefit) for 1M block size, the ideal order is 1MB The problem with "slub_max_order" is that it increases the order either always or never, but doesn't have the capability to calculate the ideal order for the given object size. The patch that I send just does this calculation. Another problem wit "slub_max_order" is that the device driver that needs to create a slab cache cannot really set it - the device driver can't modify the kernel parameters. > > I observe the same pathological rounding in dm-bufio caches. > > > > > There are some corner cases where extra metadata is needed per object or > > > per page that will result in either object sizes that are no longer a > > > power of two or in page sizes smaller than the whole page. Maybe you have > > > a case like that? Can you show me a cache that has this issue? > > > > Here I have a patch set that changes the dm-bufio subsystem to support > > buffer sizes that are not a power of two: > > http://people.redhat.com/~mpatocka/patches/kernel/dm-bufio-arbitrary-sector-size/ > > > > I need to change the slub cache to minimize wasted space - i.e. when > > asking for a slab cache for 640kB objects, the slub system currently > > allocates 1MB per object and 384kB is wasted. This is the reason why I'm > > making this patch. > > You should not be using the slab allocators for these. Allocate higher > order pages or numbers of consecutive smaller pagess from the page > allocator. The slab allocators are written for objects smaller than page > size. So, do you argue that I need to write my own slab cache functionality instead of using the existing slab code? I can do it - but duplicating code is bad thing. > > > > BTW. the function "order_store" in mm/slub.c modifies the structure > > > > kmem_cache without taking any locks - is it a bug? > > > > > > The kmem_cache structure was just allocated. Only one thread can access it > > > thus no locking is necessary. > > > > No - order_store is called when writing to /sys/kernel/slab/<cache>/order > > - you can modify order for any existing cache - and the modification > > happens without any locking. > > Well it still does not matter. The size of the order of slab pages > can be dynamic even within a slab. You can have pages of varying sizes. > > What kind of problem could be caused here? Unlocked accesses are generally considered bad. For example, see this piece of code in calculate_sizes: s->allocflags = 0; if (order) s->allocflags |= __GFP_COMP; if (s->flags & SLAB_CACHE_DMA) s->allocflags |= GFP_DMA; if (s->flags & SLAB_RECLAIM_ACCOUNT) s->allocflags |= __GFP_RECLAIMABLE; If you are running this while the cache is in use (i.e. when the user writes /sys/kernel/slab/<cache>/order), then other processes will see invalid s->allocflags for a short time. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, Mar 21, 2018 at 12:25:39PM -0400, Mikulas Patocka wrote: > Now - we don't want higher-order allocations for power-of-two caches > (because higher-order allocations just cause memory fragmentation without > any benefit) Higher-order allocations don't cause memory fragmentation. Indeed, they avoid it. They do fail as a result of fragmentation, which is probably what you meant. > , but we want higher-order allocations for non-power-of-two > caches (because higher-order allocations minimize wasted space). > > For example: > for 192K block size, the ideal order is 4MB (it takes 21 blocks) I wonder if that's true. You can get five blocks into 1MB, wasting 64kB. So going up by two orders of magnitude lets you get an extra block in at the cost of failing more frequently. > > You should not be using the slab allocators for these. Allocate higher > > order pages or numbers of consecutive smaller pagess from the page > > allocator. The slab allocators are written for objects smaller than page > > size. > > So, do you argue that I need to write my own slab cache functionality > instead of using the existing slab code? > > I can do it - but duplicating code is bad thing. It is -- but writing a special-purpose allocator can be better than making a general purpose allocator also solve a special purpose. I don't know whether that's true here or not. Your allocator seems like it could be remarkably simple; you know you're always doing high-order allocations, and you know that you're never allocating more than a handful of blocks from a page allocation. So you can probably store all of your metadata in the struct page (because your metadata is basically a bitmap) and significantly save on memory usage. The one downside I see is that you don't get the reporting through /proc/slabinfo. So, is this an area where slub should be improved, or is this a case where writing a special-purpose allocator makes more sense? It seems like you already have a special-purpose allocator, in that you know how to fall back to vmalloc if slab-alloc fails. So maybe have your own allocator that interfaces to the page allocator for now; keep its interface nice and clean, and maybe it'll get pulled out of your driver and put into mm/ some day if it becomes a useful API for everybody to share? -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, 21 Mar 2018, Mikulas Patocka wrote: > > You should not be using the slab allocators for these. Allocate higher > > order pages or numbers of consecutive smaller pagess from the page > > allocator. The slab allocators are written for objects smaller than page > > size. > > So, do you argue that I need to write my own slab cache functionality > instead of using the existing slab code? Just use the existing page allocator calls to allocate and free the memory you need. > I can do it - but duplicating code is bad thing. There is no need to duplicate anything. There is lots of infrastructure already in the kernel. You just need to use the right allocation / freeing calls. > > What kind of problem could be caused here? > > Unlocked accesses are generally considered bad. For example, see this > piece of code in calculate_sizes: > s->allocflags = 0; > if (order) > s->allocflags |= __GFP_COMP; > > if (s->flags & SLAB_CACHE_DMA) > s->allocflags |= GFP_DMA; > > if (s->flags & SLAB_RECLAIM_ACCOUNT) > s->allocflags |= __GFP_RECLAIMABLE; > > If you are running this while the cache is in use (i.e. when the user > writes /sys/kernel/slab/<cache>/order), then other processes will see > invalid s->allocflags for a short time. Calculating sizes is done when the slab has only a single accessor. Thus no locking is neeed. Changing the size of objects in a slab cache when there is already a set of object allocated and under management by the slab cache would cause the allocator to fail and lead to garbled data. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
One other thought: If you want to improve the behavior for large scale objects allocated through kmalloc/kmemcache then we would certainly be glad to entertain those ideas. F.e. you could optimize the allcations > 2x PAGE_SIZE so that they do not allocate powers of two pages. It would be relatively easy to make kmalloc_large round the allocation to the next page size and then allocate N consecutive pages via alloc_pages_exact() and free the remainder unused pages or some such thing. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, Mar 21, 2018 at 12:39:33PM -0500, Christopher Lameter wrote: > One other thought: If you want to improve the behavior for large scale > objects allocated through kmalloc/kmemcache then we would certainly be > glad to entertain those ideas. > > F.e. you could optimize the allcations > 2x PAGE_SIZE so that they do not > allocate powers of two pages. It would be relatively easy to make > kmalloc_large round the allocation to the next page size and then allocate > N consecutive pages via alloc_pages_exact() and free the remainder unused > pages or some such thing. I don't know if that's a good idea. That will contribute to fragmentation if the allocation is held onto for a short-to-medium length of time. If the allocation is for a very long period of time then those pages would have been unavailable anyway, but if the user of the tail pages holds them beyond the lifetime of the large allocation, then this is probably a bad tradeoff to make. I do see Mikulas' use case as interesting, I just don't know whether it's worth changing slab/slub to support it. At first blush, other than the sheer size of the allocations, it's a good fit. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, 21 Mar 2018, Matthew Wilcox wrote: > I don't know if that's a good idea. That will contribute to fragmentation > if the allocation is held onto for a short-to-medium length of time. > If the allocation is for a very long period of time then those pages > would have been unavailable anyway, but if the user of the tail pages > holds them beyond the lifetime of the large allocation, then this is > probably a bad tradeoff to make. > > I do see Mikulas' use case as interesting, I just don't know whether it's > worth changing slab/slub to support it. At first blush, other than the > sheer size of the allocations, it's a good fit. Well there are numerous page pool approaches already in the kernel. Maybe there is a better fit with those? -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, 21 Mar 2018, Matthew Wilcox wrote: > On Wed, Mar 21, 2018 at 12:39:33PM -0500, Christopher Lameter wrote: > > One other thought: If you want to improve the behavior for large scale > > objects allocated through kmalloc/kmemcache then we would certainly be > > glad to entertain those ideas. > > > > F.e. you could optimize the allcations > 2x PAGE_SIZE so that they do not > > allocate powers of two pages. It would be relatively easy to make > > kmalloc_large round the allocation to the next page size and then allocate > > N consecutive pages via alloc_pages_exact() and free the remainder unused > > pages or some such thing. alloc_pages_exact() has O(n*log n) complexity with respect to the number of requested pages. It would have to be reworked and optimized if it were to be used for the dm-bufio cache. (it could be optimized down to O(log n) if it didn't split the compound page to a lot of separate pages, but split it to a power-of-two clusters instead). > I don't know if that's a good idea. That will contribute to fragmentation > if the allocation is held onto for a short-to-medium length of time. > If the allocation is for a very long period of time then those pages > would have been unavailable anyway, but if the user of the tail pages > holds them beyond the lifetime of the large allocation, then this is > probably a bad tradeoff to make. The problem with alloc_pages_exact() is that it exhausts all the high-order pages and leaves many free low-order pages around. So you'll end up in a system with a lot of free memory, but with all high-order pages missing. As there would be a lot of free memory, the kswapd thread would not be woken up to free some high-order pages. I think that using slab with high order is better, because it at least doesn't leave many low-order pages behind. > I do see Mikulas' use case as interesting, I just don't know whether it's > worth changing slab/slub to support it. At first blush, other than the > sheer size of the allocations, it's a good fit. All I need is to increase the order of a specific slab cache - I think it's better to implement an interface that allows doing it than to duplicate the slab cache code. BTW. it could be possible to open the file "/sys/kernel/slab/<cache>/order" from the dm-bufio kernel driver and write the requested value there, but it seems very dirty. It would be better to have a kernel interface for that. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, 21 Mar 2018, Christopher Lameter wrote: > On Wed, 21 Mar 2018, Mikulas Patocka wrote: > > > > You should not be using the slab allocators for these. Allocate higher > > > order pages or numbers of consecutive smaller pagess from the page > > > allocator. The slab allocators are written for objects smaller than page > > > size. > > > > So, do you argue that I need to write my own slab cache functionality > > instead of using the existing slab code? > > Just use the existing page allocator calls to allocate and free the > memory you need. > > > I can do it - but duplicating code is bad thing. > > There is no need to duplicate anything. There is lots of infrastructure > already in the kernel. You just need to use the right allocation / freeing > calls. So, what would you recommend for allocating 640KB objects while minimizing wasted space? * alloc_pages - rounds up to the next power of two * kmalloc - rounds up to the next power of two * alloc_pages_exact - O(n*log n) complexity; and causes memory fragmentation if used excesivelly * vmalloc - horrible performance (modifies page tables and that causes synchronization across all CPUs) anything else? The slab cache with large order seems as a best choice for this. > > > What kind of problem could be caused here? > > > > Unlocked accesses are generally considered bad. For example, see this > > piece of code in calculate_sizes: > > s->allocflags = 0; > > if (order) > > s->allocflags |= __GFP_COMP; > > > > if (s->flags & SLAB_CACHE_DMA) > > s->allocflags |= GFP_DMA; > > > > if (s->flags & SLAB_RECLAIM_ACCOUNT) > > s->allocflags |= __GFP_RECLAIMABLE; > > > > If you are running this while the cache is in use (i.e. when the user > > writes /sys/kernel/slab/<cache>/order), then other processes will see > > invalid s->allocflags for a short time. > > Calculating sizes is done when the slab has only a single accessor. Thus > no locking is neeed. The calculation is done whenever someone writes to "/sys/kernel/slab/*/order" And you can obviously write to that file why the slab cache is in use. Try it. So, the function calculate_sizes can actually race with allocation from the slab cache. > Changing the size of objects in a slab cache when there is already a set > of object allocated and under management by the slab cache would > cause the allocator to fail and lead to garbled data. I am not talking about changing the size of objects in a slab cache. I am talking about changing the allocation order of a slab cache while the cache is in use. This can be done with the sysfs interface. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, 21 Mar 2018, Mikulas Patocka wrote: > > > F.e. you could optimize the allcations > 2x PAGE_SIZE so that they do not > > > allocate powers of two pages. It would be relatively easy to make > > > kmalloc_large round the allocation to the next page size and then allocate > > > N consecutive pages via alloc_pages_exact() and free the remainder unused > > > pages or some such thing. > > alloc_pages_exact() has O(n*log n) complexity with respect to the number > of requested pages. It would have to be reworked and optimized if it were > to be used for the dm-bufio cache. (it could be optimized down to O(log n) > if it didn't split the compound page to a lot of separate pages, but split > it to a power-of-two clusters instead). Well then a memory pool of page allocator requests may address that issue? Have a look at include/linux/mempool.h. > > I don't know if that's a good idea. That will contribute to fragmentation > > if the allocation is held onto for a short-to-medium length of time. > > If the allocation is for a very long period of time then those pages > > would have been unavailable anyway, but if the user of the tail pages > > holds them beyond the lifetime of the large allocation, then this is > > probably a bad tradeoff to make. Fragmentation is sadly a big issue. You could create a mempool on bootup or early after boot to ensure that you have a sufficient number of contiguous pages available. > The problem with alloc_pages_exact() is that it exhausts all the > high-order pages and leaves many free low-order pages around. So you'll > end up in a system with a lot of free memory, but with all high-order > pages missing. As there would be a lot of free memory, the kswapd thread > would not be woken up to free some high-order pages. I think that logic is properly balanced and will take into account pages that have been removed from the LRU expiration logic. > I think that using slab with high order is better, because it at least > doesn't leave many low-order pages behind. Any request to the slab via kmalloc with a size > 2x page size will simply lead to a page allocator request. You have the same issue. If you want to rely on the slab allocator buffering large segments for you then a mempool will also solve the issue for you and you have more control over the pool. > BTW. it could be possible to open the file > "/sys/kernel/slab/<cache>/order" from the dm-bufio kernel driver and write > the requested value there, but it seems very dirty. It would be better to > have a kernel interface for that. Hehehe you could directly write to the kmem_cache structure and increase the order. AFAICT this would be dirty but work. But still the increased page order will get you into trouble with fragmentation when the system runs for a long time. That is the reason we try to limit the allocation sizes coming from the slab allocator. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, 21 Mar 2018, Christopher Lameter wrote: > On Wed, 21 Mar 2018, Mikulas Patocka wrote: > > > > > F.e. you could optimize the allcations > 2x PAGE_SIZE so that they do not > > > > allocate powers of two pages. It would be relatively easy to make > > > > kmalloc_large round the allocation to the next page size and then allocate > > > > N consecutive pages via alloc_pages_exact() and free the remainder unused > > > > pages or some such thing. > > > > alloc_pages_exact() has O(n*log n) complexity with respect to the number > > of requested pages. It would have to be reworked and optimized if it were > > to be used for the dm-bufio cache. (it could be optimized down to O(log n) > > if it didn't split the compound page to a lot of separate pages, but split > > it to a power-of-two clusters instead). > > Well then a memory pool of page allocator requests may address that issue? > > Have a look at include/linux/mempool.h. I know the mempool interface. Mempool can keep some amount reserved objects if the system memory is exhausted. Mempool doesn't deal with object allocation at all - mempool needs to be hooked to an existing object allocator (slab cache, kmalloc, alloc_pages, or some custom allocator provided with the methods mempool_alloc_t and mempool_free_t). > > > I don't know if that's a good idea. That will contribute to fragmentation > > > if the allocation is held onto for a short-to-medium length of time. > > > If the allocation is for a very long period of time then those pages > > > would have been unavailable anyway, but if the user of the tail pages > > > holds them beyond the lifetime of the large allocation, then this is > > > probably a bad tradeoff to make. > > Fragmentation is sadly a big issue. You could create a mempool on bootup > or early after boot to ensure that you have a sufficient number of > contiguous pages available. The dm-bufio driver deals correctly with this - it preallocates several buffers with vmalloc when the dm-bufio cache is created. During operation, if a high-order allocation fails, the dm-bufio subsystem throws away some existing buffer and reuses the already allocated chunk of memory for the buffer that needs to be created. So, fragmentation is not an issue with this use case. dm-bufio can make forward progress even if memory is totally exhausted. > > The problem with alloc_pages_exact() is that it exhausts all the > > high-order pages and leaves many free low-order pages around. So you'll > > end up in a system with a lot of free memory, but with all high-order > > pages missing. As there would be a lot of free memory, the kswapd thread > > would not be woken up to free some high-order pages. > > I think that logic is properly balanced and will take into account pages > that have been removed from the LRU expiration logic. > > > I think that using slab with high order is better, because it at least > > doesn't leave many low-order pages behind. > > Any request to the slab via kmalloc with a size > 2x page size will simply > lead to a page allocator request. You have the same issue. If you want to > rely on the slab allocator buffering large segments for you then a mempool > will also solve the issue for you and you have more control over the pool. mempool solves nothing because it needs a backing allocator. And the question is what this backing allocator should be. > > BTW. it could be possible to open the file > > "/sys/kernel/slab/<cache>/order" from the dm-bufio kernel driver and write > > the requested value there, but it seems very dirty. It would be better to > > have a kernel interface for that. > > Hehehe you could directly write to the kmem_cache structure and increase > the order. AFAICT this would be dirty but work. > > But still the increased page order will get you into trouble with > fragmentation when the system runs for a long time. That is the reason we > try to limit the allocation sizes coming from the slab allocator. It won't - see above - if the high-order allocation fails, dm-bufio just reuses some existing buffer. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, Mar 21, 2018 at 01:40:31PM -0500, Christopher Lameter wrote: > On Wed, 21 Mar 2018, Mikulas Patocka wrote: > > > > > F.e. you could optimize the allcations > 2x PAGE_SIZE so that they do not > > > > allocate powers of two pages. It would be relatively easy to make > > > > kmalloc_large round the allocation to the next page size and then allocate > > > > N consecutive pages via alloc_pages_exact() and free the remainder unused > > > > pages or some such thing. > > > > alloc_pages_exact() has O(n*log n) complexity with respect to the number > > of requested pages. It would have to be reworked and optimized if it were > > to be used for the dm-bufio cache. (it could be optimized down to O(log n) > > if it didn't split the compound page to a lot of separate pages, but split > > it to a power-of-two clusters instead). > > Well then a memory pool of page allocator requests may address that issue? > > Have a look at include/linux/mempool.h. That's not what mempool is for. mempool is a cache of elements that were allocated from slab in the first place. (OK, technically, you don't have to use slab as the allocator, but since there is no allocator that solves this problem, mempool doesn't solve the problem either!) > > BTW. it could be possible to open the file > > "/sys/kernel/slab/<cache>/order" from the dm-bufio kernel driver and write > > the requested value there, but it seems very dirty. It would be better to > > have a kernel interface for that. > > Hehehe you could directly write to the kmem_cache structure and increase > the order. AFAICT this would be dirty but work. > > But still the increased page order will get you into trouble with > fragmentation when the system runs for a long time. That is the reason we > try to limit the allocation sizes coming from the slab allocator. Right; he has a fallback already (vmalloc). So ... let's just add the interface to allow slab caches to have their order tuned by users who really know what they're doing? -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, 21 Mar 2018, Mikulas Patocka wrote: > So, what would you recommend for allocating 640KB objects while minimizing > wasted space? > * alloc_pages - rounds up to the next power of two > * kmalloc - rounds up to the next power of two > * alloc_pages_exact - O(n*log n) complexity; and causes memory > fragmentation if used excesivelly > * vmalloc - horrible performance (modifies page tables and that causes > synchronization across all CPUs) > > anything else? Need to find it but there is a way to allocate N pages in sequence somewhere. Otherwise mempools are something that would work. > > > > What kind of problem could be caused here? > > > > > > Unlocked accesses are generally considered bad. For example, see this > > > piece of code in calculate_sizes: > > > s->allocflags = 0; > > > if (order) > > > s->allocflags |= __GFP_COMP; > > > > > > if (s->flags & SLAB_CACHE_DMA) > > > s->allocflags |= GFP_DMA; > > > > > > if (s->flags & SLAB_RECLAIM_ACCOUNT) > > > s->allocflags |= __GFP_RECLAIMABLE; > > > > > > If you are running this while the cache is in use (i.e. when the user > > > writes /sys/kernel/slab/<cache>/order), then other processes will see > > > invalid s->allocflags for a short time. > > > > Calculating sizes is done when the slab has only a single accessor. Thus > > no locking is neeed. > > The calculation is done whenever someone writes to > "/sys/kernel/slab/*/order" But the flags you are mentioning do not change and the size of the object does not change. What changes is the number of objects in the slab page. > And you can obviously write to that file why the slab cache is in use. Try > it. You cannot change flags that would impact the size of the objects. I only allowed changing characteristics that does not impact object size. > I am not talking about changing the size of objects in a slab cache. I am > talking about changing the allocation order of a slab cache while the > cache is in use. This can be done with the sysfs interface. But then that is something that is allowed but does not affect the object size used by the slab allocators. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, 21 Mar 2018, Matthew Wilcox wrote: > > Have a look at include/linux/mempool.h. > > That's not what mempool is for. mempool is a cache of elements that were > allocated from slab in the first place. (OK, technically, you don't have > to use slab as the allocator, but since there is no allocator that solves > this problem, mempool doesn't solve the problem either!) You can put the page allocator in there instead of a slab allocator. > > But still the increased page order will get you into trouble with > > fragmentation when the system runs for a long time. That is the reason we > > try to limit the allocation sizes coming from the slab allocator. > > Right; he has a fallback already (vmalloc). So ... let's just add the > interface to allow slab caches to have their order tuned by users who > really know what they're doing? Ok thats trivial. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, 21 Mar 2018, Christopher Lameter wrote: > On Wed, 21 Mar 2018, Mikulas Patocka wrote: > > > So, what would you recommend for allocating 640KB objects while minimizing > > wasted space? > > * alloc_pages - rounds up to the next power of two > > * kmalloc - rounds up to the next power of two > > * alloc_pages_exact - O(n*log n) complexity; and causes memory > > fragmentation if used excesivelly > > * vmalloc - horrible performance (modifies page tables and that causes > > synchronization across all CPUs) > > > > anything else? > > Need to find it but there is a way to allocate N pages in sequence > somewhere. Otherwise mempools are something that would work. There's also continuous-memory-allocator, but it needs its memory to be reserved at boot time. It is intended for misdesigned hardware devices that need continuous memory for DMA. As it's intended for one-time allocations when loading drivers, it lacks the performance and scalability of the slab cache and alloc_pages. > > > > > What kind of problem could be caused here? > > > > > > > > Unlocked accesses are generally considered bad. For example, see this > > > > piece of code in calculate_sizes: > > > > s->allocflags = 0; > > > > if (order) > > > > s->allocflags |= __GFP_COMP; > > > > > > > > if (s->flags & SLAB_CACHE_DMA) > > > > s->allocflags |= GFP_DMA; > > > > > > > > if (s->flags & SLAB_RECLAIM_ACCOUNT) > > > > s->allocflags |= __GFP_RECLAIMABLE; > > > > > > > > If you are running this while the cache is in use (i.e. when the user > > > > writes /sys/kernel/slab/<cache>/order), then other processes will see > > > > invalid s->allocflags for a short time. > > > > > > Calculating sizes is done when the slab has only a single accessor. Thus > > > no locking is neeed. > > > > The calculation is done whenever someone writes to > > "/sys/kernel/slab/*/order" > > But the flags you are mentioning do not change and the size of the object > does not change. What changes is the number of objects in the slab page. See this code again: > > > s->allocflags = 0; > > > if (order) > > > s->allocflags |= __GFP_COMP; > > > > > > if (s->flags & SLAB_CACHE_DMA) > > > s->allocflags |= GFP_DMA; > > > > > > if (s->flags & SLAB_RECLAIM_ACCOUNT) > > > s->allocflags |= __GFP_RECLAIMABLE; when this function is called, the value s->allocflags does change. At the end, s->allocflags holds the same value as before, but it changes temporarily. For example, if someone creates a slab cache with the flag SLAB_CACHE_DMA, and he allocates an object from this cache and this allocation races with the user writing to /sys/kernel/slab/cache/order - then the allocator can for a small period of time see "s->allocflags == 0" and allocate a non-DMA page. That is a bug. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Wed, 21 Mar 2018, Christopher Lameter wrote: > One other thought: If you want to improve the behavior for large scale > objects allocated through kmalloc/kmemcache then we would certainly be > glad to entertain those ideas. > > F.e. you could optimize the allcations > 2x PAGE_SIZE so that they do not > allocate powers of two pages. It would be relatively easy to make > kmalloc_large round the allocation to the next page size and then allocate > N consecutive pages via alloc_pages_exact() and free the remainder unused > pages or some such thing. It may be possible, but we'd need to improve the horrible complexity of alloc_pages_exact(). This is a trade-of between performance and waste. A power-of-two allocation can be done quicky, but it wastes a lot of space. alloc_pages_exact() wastes less space, but it is slow. The question is - how many of these large-kmalloc allocations are short-lived and how many are long-lived? I don't know, I haven't measured it. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On 03/21/2018 07:36 PM, Mikulas Patocka wrote: > > > On Wed, 21 Mar 2018, Christopher Lameter wrote: > >> On Wed, 21 Mar 2018, Mikulas Patocka wrote: >> >>>> You should not be using the slab allocators for these. Allocate higher >>>> order pages or numbers of consecutive smaller pagess from the page >>>> allocator. The slab allocators are written for objects smaller than page >>>> size. >>> >>> So, do you argue that I need to write my own slab cache functionality >>> instead of using the existing slab code? >> >> Just use the existing page allocator calls to allocate and free the >> memory you need. >> >>> I can do it - but duplicating code is bad thing. >> >> There is no need to duplicate anything. There is lots of infrastructure >> already in the kernel. You just need to use the right allocation / freeing >> calls. > > So, what would you recommend for allocating 640KB objects while minimizing > wasted space? > * alloc_pages - rounds up to the next power of two > * kmalloc - rounds up to the next power of two > * alloc_pages_exact - O(n*log n) complexity; and causes memory > fragmentation if used excesivelly > * vmalloc - horrible performance (modifies page tables and that causes > synchronization across all CPUs) > > anything else? > > The slab cache with large order seems as a best choice for this. Sorry for being late, I just read this thread and tend to agree with Mikulas, that this is a good use case for SL*B. If we extend the use-case from "space-efficient allocator of objects smaller than page size" to "space-efficient allocator of objects that are not power-of-two pages" then IMHO it turns out the implementation would be almost the same. All other variants listed above would lead to waste of memory or fragmentation. Would this perhaps be a good LSF/MM discussion topic? Mikulas, are you attending, or anyone else that can vouch for your usecase? -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Fri, Apr 13 2018 at 5:22am -0400, Vlastimil Babka <vbabka@suse.cz> wrote: > On 03/21/2018 07:36 PM, Mikulas Patocka wrote: > > > > > > On Wed, 21 Mar 2018, Christopher Lameter wrote: > > > >> On Wed, 21 Mar 2018, Mikulas Patocka wrote: > >> > >>>> You should not be using the slab allocators for these. Allocate higher > >>>> order pages or numbers of consecutive smaller pagess from the page > >>>> allocator. The slab allocators are written for objects smaller than page > >>>> size. > >>> > >>> So, do you argue that I need to write my own slab cache functionality > >>> instead of using the existing slab code? > >> > >> Just use the existing page allocator calls to allocate and free the > >> memory you need. > >> > >>> I can do it - but duplicating code is bad thing. > >> > >> There is no need to duplicate anything. There is lots of infrastructure > >> already in the kernel. You just need to use the right allocation / freeing > >> calls. > > > > So, what would you recommend for allocating 640KB objects while minimizing > > wasted space? > > * alloc_pages - rounds up to the next power of two > > * kmalloc - rounds up to the next power of two > > * alloc_pages_exact - O(n*log n) complexity; and causes memory > > fragmentation if used excesivelly > > * vmalloc - horrible performance (modifies page tables and that causes > > synchronization across all CPUs) > > > > anything else? > > > > The slab cache with large order seems as a best choice for this. > > Sorry for being late, I just read this thread and tend to agree with > Mikulas, that this is a good use case for SL*B. If we extend the > use-case from "space-efficient allocator of objects smaller than page > size" to "space-efficient allocator of objects that are not power-of-two > pages" then IMHO it turns out the implementation would be almost the > same. All other variants listed above would lead to waste of memory or > fragmentation. > > Would this perhaps be a good LSF/MM discussion topic? Mikulas, are you > attending, or anyone else that can vouch for your usecase? Any further discussion on SLAB_MINIMIZE_WASTE should continue on list. Mikulas won't be at LSF/MM. But I included Mikulas' dm-bufio changes that no longer depend on this proposed SLAB_MINIMIZE_WASTE (as part of the 4.17 merge window). Mike -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On 04/13/2018 05:10 PM, Mike Snitzer wrote: > On Fri, Apr 13 2018 at 5:22am -0400, > Vlastimil Babka <vbabka@suse.cz> wrote: >> >> Would this perhaps be a good LSF/MM discussion topic? Mikulas, are you >> attending, or anyone else that can vouch for your usecase? > > Any further discussion on SLAB_MINIMIZE_WASTE should continue on list. > > Mikulas won't be at LSF/MM. But I included Mikulas' dm-bufio changes > that no longer depend on this proposed SLAB_MINIMIZE_WASTE (as part of > the 4.17 merge window). Can you or Mikulas briefly summarize how the dependency is avoided, and whether if (something like) SLAB_MINIMIZE_WASTE were implemented, the dm-bufio code would happily switch to it, or not? Thanks, Vlastimil > Mike > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Mon, Apr 16 2018 at 8:38am -0400, Vlastimil Babka <vbabka@suse.cz> wrote: > On 04/13/2018 05:10 PM, Mike Snitzer wrote: > > On Fri, Apr 13 2018 at 5:22am -0400, > > Vlastimil Babka <vbabka@suse.cz> wrote: > >> > >> Would this perhaps be a good LSF/MM discussion topic? Mikulas, are you > >> attending, or anyone else that can vouch for your usecase? > > > > Any further discussion on SLAB_MINIMIZE_WASTE should continue on list. > > > > Mikulas won't be at LSF/MM. But I included Mikulas' dm-bufio changes > > that no longer depend on this proposed SLAB_MINIMIZE_WASTE (as part of > > the 4.17 merge window). > > Can you or Mikulas briefly summarize how the dependency is avoided, and > whether if (something like) SLAB_MINIMIZE_WASTE were implemented, the > dm-bufio code would happily switch to it, or not? git log eeb67a0ba04df^..45354f1eb67224669a1 -- drivers/md/dm-bufio.c But the most signficant commit relative to SLAB_MINIMIZE_WASTE is: 359dbf19ab524652a2208a2a2cddccec2eede2ad ("dm bufio: use slab cache for dm_buffer structure allocations") So no, I don't see why dm-bufio would need to switch to SLAB_MINIMIZE_WASTE if it were introduced in the future. Mike -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Mon, 16 Apr 2018, Mike Snitzer wrote: > On Mon, Apr 16 2018 at 8:38am -0400, > Vlastimil Babka <vbabka@suse.cz> wrote: > > > On 04/13/2018 05:10 PM, Mike Snitzer wrote: > > > On Fri, Apr 13 2018 at 5:22am -0400, > > > Vlastimil Babka <vbabka@suse.cz> wrote: > > >> > > >> Would this perhaps be a good LSF/MM discussion topic? Mikulas, are you > > >> attending, or anyone else that can vouch for your usecase? > > > > > > Any further discussion on SLAB_MINIMIZE_WASTE should continue on list. > > > > > > Mikulas won't be at LSF/MM. But I included Mikulas' dm-bufio changes > > > that no longer depend on this proposed SLAB_MINIMIZE_WASTE (as part of > > > the 4.17 merge window). > > > > Can you or Mikulas briefly summarize how the dependency is avoided, and > > whether if (something like) SLAB_MINIMIZE_WASTE were implemented, the > > dm-bufio code would happily switch to it, or not? > > git log eeb67a0ba04df^..45354f1eb67224669a1 -- drivers/md/dm-bufio.c > > But the most signficant commit relative to SLAB_MINIMIZE_WASTE is: > 359dbf19ab524652a2208a2a2cddccec2eede2ad ("dm bufio: use slab cache for > dm_buffer structure allocations") > > So no, I don't see why dm-bufio would need to switch to > SLAB_MINIMIZE_WASTE if it were introduced in the future. Currently, the slab cache rounds up the size of the slab to the next power of two (if the size is large). And that wastes memory if that memory were to be used for deduplication tables. Generally, the performance of the deduplication solution depends on how much data can you put to memory. If you round 640KB buffer to 1MB (this is what the slab and slub subsystem currently do), you waste a lot of memory. Deduplication indices with 640KB blocks are already used in the wild, so it can't be easily changed. > Mike Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Mon, Apr 16 2018 at 10:37am -0400, Mikulas Patocka <mpatocka@redhat.com> wrote: > > > On Mon, 16 Apr 2018, Mike Snitzer wrote: > > > On Mon, Apr 16 2018 at 8:38am -0400, > > Vlastimil Babka <vbabka@suse.cz> wrote: > > > > > On 04/13/2018 05:10 PM, Mike Snitzer wrote: > > > > On Fri, Apr 13 2018 at 5:22am -0400, > > > > Vlastimil Babka <vbabka@suse.cz> wrote: > > > >> > > > >> Would this perhaps be a good LSF/MM discussion topic? Mikulas, are you > > > >> attending, or anyone else that can vouch for your usecase? > > > > > > > > Any further discussion on SLAB_MINIMIZE_WASTE should continue on list. > > > > > > > > Mikulas won't be at LSF/MM. But I included Mikulas' dm-bufio changes > > > > that no longer depend on this proposed SLAB_MINIMIZE_WASTE (as part of > > > > the 4.17 merge window). > > > > > > Can you or Mikulas briefly summarize how the dependency is avoided, and > > > whether if (something like) SLAB_MINIMIZE_WASTE were implemented, the > > > dm-bufio code would happily switch to it, or not? > > > > git log eeb67a0ba04df^..45354f1eb67224669a1 -- drivers/md/dm-bufio.c > > > > But the most signficant commit relative to SLAB_MINIMIZE_WASTE is: > > 359dbf19ab524652a2208a2a2cddccec2eede2ad ("dm bufio: use slab cache for > > dm_buffer structure allocations") > > > > So no, I don't see why dm-bufio would need to switch to > > SLAB_MINIMIZE_WASTE if it were introduced in the future. > > Currently, the slab cache rounds up the size of the slab to the next power > of two (if the size is large). And that wastes memory if that memory were > to be used for deduplication tables. You mean on an overall size of the cache level? Or on a per-object level? I can only imagine you mean the former. > Generally, the performance of the deduplication solution depends on how > much data can you put to memory. If you round 640KB buffer to 1MB (this is > what the slab and slub subsystem currently do), you waste a lot of memory. > Deduplication indices with 640KB blocks are already used in the wild, so > it can't be easily changed. OK, seems you're suggesting a single object is rounded up.. so then this header is very wrong?: commit 359dbf19ab524652a2208a2a2cddccec2eede2ad Author: Mikulas Patocka <mpatocka@redhat.com> Date: Mon Mar 26 20:29:45 2018 +0200 dm bufio: use slab cache for dm_buffer structure allocations kmalloc padded to the next power of two, using a slab cache avoids this. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Please clarify further, thanks! Mike -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Mon, 16 Apr 2018, Mike Snitzer wrote: > On Mon, Apr 16 2018 at 10:37am -0400, > Mikulas Patocka <mpatocka@redhat.com> wrote: > > > > > > > On Mon, 16 Apr 2018, Mike Snitzer wrote: > > > > > On Mon, Apr 16 2018 at 8:38am -0400, > > > Vlastimil Babka <vbabka@suse.cz> wrote: > > > > > > > On 04/13/2018 05:10 PM, Mike Snitzer wrote: > > > > > On Fri, Apr 13 2018 at 5:22am -0400, > > > > > Vlastimil Babka <vbabka@suse.cz> wrote: > > > > >> > > > > >> Would this perhaps be a good LSF/MM discussion topic? Mikulas, are you > > > > >> attending, or anyone else that can vouch for your usecase? > > > > > > > > > > Any further discussion on SLAB_MINIMIZE_WASTE should continue on list. > > > > > > > > > > Mikulas won't be at LSF/MM. But I included Mikulas' dm-bufio changes > > > > > that no longer depend on this proposed SLAB_MINIMIZE_WASTE (as part of > > > > > the 4.17 merge window). > > > > > > > > Can you or Mikulas briefly summarize how the dependency is avoided, and > > > > whether if (something like) SLAB_MINIMIZE_WASTE were implemented, the > > > > dm-bufio code would happily switch to it, or not? > > > > > > git log eeb67a0ba04df^..45354f1eb67224669a1 -- drivers/md/dm-bufio.c > > > > > > But the most signficant commit relative to SLAB_MINIMIZE_WASTE is: > > > 359dbf19ab524652a2208a2a2cddccec2eede2ad ("dm bufio: use slab cache for > > > dm_buffer structure allocations") > > > > > > So no, I don't see why dm-bufio would need to switch to > > > SLAB_MINIMIZE_WASTE if it were introduced in the future. > > > > Currently, the slab cache rounds up the size of the slab to the next power > > of two (if the size is large). And that wastes memory if that memory were > > to be used for deduplication tables. > > You mean on an overall size of the cache level? Or on a per-object > level? I can only imagine you mean the former. Unfortunatelly, it rounds up every object. So, if you have six 640KB objects, it consumes 6MB. > > Generally, the performance of the deduplication solution depends on how > > much data can you put to memory. If you round 640KB buffer to 1MB (this is > > what the slab and slub subsystem currently do), you waste a lot of memory. > > Deduplication indices with 640KB blocks are already used in the wild, so > > it can't be easily changed. > > OK, seems you're suggesting a single object is rounded up.. so then this > header is very wrong?: > > commit 359dbf19ab524652a2208a2a2cddccec2eede2ad > Author: Mikulas Patocka <mpatocka@redhat.com> > Date: Mon Mar 26 20:29:45 2018 +0200 > > dm bufio: use slab cache for dm_buffer structure allocations > > kmalloc padded to the next power of two, using a slab cache avoids this. > > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> > Signed-off-by: Mike Snitzer <snitzer@redhat.com> > > Please clarify further, thanks! > Mike Yes, using a slab cache currently doesn't avoid this rouding (it needs the SLAB_MINIMIZE_WASTE patch to do that). Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Mon, 16 Apr 2018, Mikulas Patocka wrote: > > Please clarify further, thanks! > > Mike > > Yes, using a slab cache currently doesn't avoid this rouding (it needs the > SLAB_MINIMIZE_WASTE patch to do that). Or an increase in slab_max_order -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Mon, 16 Apr 2018, Christopher Lameter wrote: > On Mon, 16 Apr 2018, Mikulas Patocka wrote: > > > > Please clarify further, thanks! > > > Mike > > > > Yes, using a slab cache currently doesn't avoid this rouding (it needs the > > SLAB_MINIMIZE_WASTE patch to do that). > > Or an increase in slab_max_order But that will increase it for all slabs (often senselessly - i.e. kmalloc-4096 would have order 4MB). I need to increase it just for dm-bufio slabs. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Mon, 16 Apr 2018, Mikulas Patocka wrote: > > > > Or an increase in slab_max_order > > But that will increase it for all slabs (often senselessly - i.e. > kmalloc-4096 would have order 4MB). 4MB? Nope.... That is a power of two slab so no wasted space even with order 0. Its not a senseless increase. The more objects you fit into a slab page the higher the performance of the allocator. > I need to increase it just for dm-bufio slabs. If you do this then others will want the same... -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Mon, 16 Apr 2018, Christopher Lameter wrote: > On Mon, 16 Apr 2018, Mikulas Patocka wrote: > > > > > > > Or an increase in slab_max_order > > > > But that will increase it for all slabs (often senselessly - i.e. > > kmalloc-4096 would have order 4MB). > > 4MB? Nope.... That is a power of two slab so no wasted space even with > order 0. See this email: https://www.redhat.com/archives/dm-devel/2018-March/msg00387.html If you boot with slub_max_order=10, the kmalloc-8192 cache has 64 pages. So yes, it increases the order of all slab caches (although not up to 4MB). > Its not a senseless increase. The more objects you fit into a slab page > the higher the performance of the allocator. > > > > I need to increase it just for dm-bufio slabs. > > If you do this then others will want the same... If others need it, they can turn on the flag SLAB_MINIMIZE_WASTE too. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On 04/16/2018 04:37 PM, Mikulas Patocka wrote: >>> Can you or Mikulas briefly summarize how the dependency is avoided, and >>> whether if (something like) SLAB_MINIMIZE_WASTE were implemented, the >>> dm-bufio code would happily switch to it, or not? >> >> git log eeb67a0ba04df^..45354f1eb67224669a1 -- drivers/md/dm-bufio.c >> >> But the most signficant commit relative to SLAB_MINIMIZE_WASTE is: >> 359dbf19ab524652a2208a2a2cddccec2eede2ad ("dm bufio: use slab cache for >> dm_buffer structure allocations") >> >> So no, I don't see why dm-bufio would need to switch to >> SLAB_MINIMIZE_WASTE if it were introduced in the future. > > Currently, the slab cache rounds up the size of the slab to the next power > of two (if the size is large). And that wastes memory if that memory were > to be used for deduplication tables. > > Generally, the performance of the deduplication solution depends on how > much data can you put to memory. If you round 640KB buffer to 1MB (this is > what the slab and slub subsystem currently do), you waste a lot of memory. > Deduplication indices with 640KB blocks are already used in the wild, so > it can't be easily changed. Thank you both for the clarification. Vlastimil > >> Mike > > Mikulas > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On 04/16/2018 09:36 PM, Mikulas Patocka wrote: > > > On Mon, 16 Apr 2018, Christopher Lameter wrote: > >> On Mon, 16 Apr 2018, Mikulas Patocka wrote: >> >>>> >>>> Or an increase in slab_max_order >>> >>> But that will increase it for all slabs (often senselessly - i.e. >>> kmalloc-4096 would have order 4MB). >> >> 4MB? Nope.... That is a power of two slab so no wasted space even with >> order 0. > > See this email: > https://www.redhat.com/archives/dm-devel/2018-March/msg00387.html > > If you boot with slub_max_order=10, the kmalloc-8192 cache has 64 pages. > So yes, it increases the order of all slab caches (although not up to > 4MB). > >> Its not a senseless increase. The more objects you fit into a slab page >> the higher the performance of the allocator. It's not universally without a cost. It might increase internal fragmentation of the slabs, if you end up with lots of 4MB pages containing just few objects. Thus, waste of memory. You also consume high-order pages that could be used elsewhere. If you fail to allocate 4MB, then what's the fallback, order-0? I doubt it's "the highest available order". Thus, a more conservative choice e.g. order-3 will might succeed more in allocating order-3, while a choice of 4MB will have many order-0 fallbacks. >>> I need to increase it just for dm-bufio slabs. >> >> If you do this then others will want the same... > > If others need it, they can turn on the flag SLAB_MINIMIZE_WASTE too. I think it should be possible without a new flag. The slub allocator could just balance priorities (performance vs memory efficiency) better. Currently I get the impression that "slub_max_order" is a performance tunable. Let's add another criteria for selecting an order, that would try to pick an order to minimize wasted space below e.g. 10% with some different kind of max order. Pick good defaults, add tunables if you must. I mean, anyone who's creating a cache for 640KB objects most likely doesn't want to waste another 384KB by each such object. They shouldn't have to add a flag to let the slub allocator figure out that using 2MB pages is the right thing to do here. Vlastimil > Mikulas > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Mon, 16 Apr 2018, Vlastimil Babka wrote: > On 04/16/2018 09:36 PM, Mikulas Patocka wrote: > > >>> I need to increase it just for dm-bufio slabs. > >> > >> If you do this then others will want the same... > > > > If others need it, they can turn on the flag SLAB_MINIMIZE_WASTE too. > > I think it should be possible without a new flag. The slub allocator > could just balance priorities (performance vs memory efficiency) better. > Currently I get the impression that "slub_max_order" is a performance > tunable. Let's add another criteria for selecting an order, that would > try to pick an order to minimize wasted space below e.g. 10% with some > different kind of max order. Pick good defaults, add tunables if you must. > > I mean, anyone who's creating a cache for 640KB objects most likely > doesn't want to waste another 384KB by each such object. They shouldn't > have to add a flag to let the slub allocator figure out that using 2MB > pages is the right thing to do here. > > Vlastimil The problem is that higher-order allocations (larger than 32K) are unreliable. So, if you increase page order beyond that, the allocation may randomly fail. dm-bufio deals gracefully with allocation failure, because it preallocates some buffers with vmalloc, but other subsystems may not deal with it and they cound return ENOMEM randomly or misbehave in other ways. So, the "SLAB_MINIMIZE_WASTE" flag is also saying that the allocation may fail and the caller is prepared to deal with it. The slub subsystem does actual fallback to low-order when the allocation fails (it allows different order for each slab in the same cache), but slab doesn't fallback and you get NULL if higher-order allocation fails. So, SLAB_MINIMIZE_WASTE is needed for slab because it will just randomly fail with higher order. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Mon, 16 Apr 2018, Vlastimil Babka wrote: > On 04/13/2018 05:10 PM, Mike Snitzer wrote: > > On Fri, Apr 13 2018 at 5:22am -0400, > > Vlastimil Babka <vbabka@suse.cz> wrote: > >> > >> Would this perhaps be a good LSF/MM discussion topic? Mikulas, are you > >> attending, or anyone else that can vouch for your usecase? > > > > Any further discussion on SLAB_MINIMIZE_WASTE should continue on list. > > > > Mikulas won't be at LSF/MM. But I included Mikulas' dm-bufio changes > > that no longer depend on this proposed SLAB_MINIMIZE_WASTE (as part of > > the 4.17 merge window). > > Can you or Mikulas briefly summarize how the dependency is avoided, and This was Mike's misconception - dm-bufio actually needs the SLAB_MINIMIZE_WASTE patch, otherwise it is wasting memory. > whether if (something like) SLAB_MINIMIZE_WASTE were implemented, the > dm-bufio code would happily switch to it, or not? > > Thanks, > Vlastimil Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Mon, 16 Apr 2018, Mikulas Patocka wrote: > dm-bufio deals gracefully with allocation failure, because it preallocates > some buffers with vmalloc, but other subsystems may not deal with it and > they cound return ENOMEM randomly or misbehave in other ways. So, the > "SLAB_MINIMIZE_WASTE" flag is also saying that the allocation may fail and > the caller is prepared to deal with it. > > The slub subsystem does actual fallback to low-order when the allocation > fails (it allows different order for each slab in the same cache), but > slab doesn't fallback and you get NULL if higher-order allocation fails. > So, SLAB_MINIMIZE_WASTE is needed for slab because it will just randomly > fail with higher order. Fix Slab instead of adding a flag that is only useful for one allocator? -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Mon, 16 Apr 2018, Mikulas Patocka wrote: > If you boot with slub_max_order=10, the kmalloc-8192 cache has 64 pages. > So yes, it increases the order of all slab caches (although not up to > 4MB). Hmmm... Ok. There is another setting slub_min_objects that controls how many objects to fit into a slab page. We could change the allocation scheme so that it finds the mininum order with the minimum waste. Allocator performance will drop though since fewer object are then in a slab page. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Mon, 16 Apr 2018, Vlastimil Babka wrote: > >> Its not a senseless increase. The more objects you fit into a slab page > >> the higher the performance of the allocator. > > It's not universally without a cost. It might increase internal > fragmentation of the slabs, if you end up with lots of 4MB pages > containing just few objects. Thus, waste of memory. You also consume > high-order pages that could be used elsewhere. If you fail to allocate > 4MB, then what's the fallback, order-0? I doubt it's "the highest > available order". Thus, a more conservative choice e.g. order-3 will > might succeed more in allocating order-3, while a choice of 4MB will > have many order-0 fallbacks. Obviously there is a cost. But here the subsystem has a fallback. > I think it should be possible without a new flag. The slub allocator > could just balance priorities (performance vs memory efficiency) better. > Currently I get the impression that "slub_max_order" is a performance > tunable. Let's add another criteria for selecting an order, that would > try to pick an order to minimize wasted space below e.g. 10% with some > different kind of max order. Pick good defaults, add tunables if you must. There is also slub_min_objects. > I mean, anyone who's creating a cache for 640KB objects most likely > doesn't want to waste another 384KB by each such object. They shouldn't > have to add a flag to let the slub allocator figure out that using 2MB > pages is the right thing to do here. I agree if we do this then preferably without a flag. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, 17 Apr 2018, Christopher Lameter wrote: > On Mon, 16 Apr 2018, Mikulas Patocka wrote: > > > dm-bufio deals gracefully with allocation failure, because it preallocates > > some buffers with vmalloc, but other subsystems may not deal with it and > > they cound return ENOMEM randomly or misbehave in other ways. So, the > > "SLAB_MINIMIZE_WASTE" flag is also saying that the allocation may fail and > > the caller is prepared to deal with it. > > > > The slub subsystem does actual fallback to low-order when the allocation > > fails (it allows different order for each slab in the same cache), but > > slab doesn't fallback and you get NULL if higher-order allocation fails. > > So, SLAB_MINIMIZE_WASTE is needed for slab because it will just randomly > > fail with higher order. > > Fix Slab instead of adding a flag that is only useful for one allocator? Slab assumes that all slabs have the same order, so it's not so easy to fix it. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, 17 Apr 2018, Mikulas Patocka wrote: > > > The slub subsystem does actual fallback to low-order when the allocation > > > fails (it allows different order for each slab in the same cache), but > > > slab doesn't fallback and you get NULL if higher-order allocation fails. > > > So, SLAB_MINIMIZE_WASTE is needed for slab because it will just randomly > > > fail with higher order. > > > > Fix Slab instead of adding a flag that is only useful for one allocator? > > Slab assumes that all slabs have the same order, so it's not so easy to > fix it. Well since SLAB uses compound pages one could easily determine the order of the page and thus also support multiple page orders there. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Index: linux-2.6/include/linux/slab.h =================================================================== --- linux-2.6.orig/include/linux/slab.h 2018-03-20 14:59:20.528030000 +0100 +++ linux-2.6/include/linux/slab.h 2018-03-20 14:59:20.518030000 +0100 @@ -108,6 +108,13 @@ #define SLAB_KASAN 0 #endif +/* + * Use higer order allocations to minimize wasted space. + * Note: the allocation is unreliable if this flag is used, the caller + * must handle allocation failures gracefully. + */ +#define SLAB_MINIMIZE_WASTE ((slab_flags_t __force)0x10000000U) + /* The following flags affect the page allocator grouping pages by mobility */ /* Objects are reclaimable */ #define SLAB_RECLAIM_ACCOUNT ((slab_flags_t __force)0x00020000U) Index: linux-2.6/mm/slab_common.c =================================================================== --- linux-2.6.orig/mm/slab_common.c 2018-03-20 14:59:20.528030000 +0100 +++ linux-2.6/mm/slab_common.c 2018-03-20 14:59:20.518030000 +0100 @@ -52,7 +52,7 @@ static DECLARE_WORK(slab_caches_to_rcu_d SLAB_FAILSLAB | SLAB_KASAN) #define SLAB_MERGE_SAME (SLAB_RECLAIM_ACCOUNT | SLAB_CACHE_DMA | \ - SLAB_ACCOUNT) + SLAB_ACCOUNT | SLAB_MINIMIZE_WASTE) /* * Merge control. If this is set then no merging of slab caches will occur. Index: linux-2.6/mm/slub.c =================================================================== --- linux-2.6.orig/mm/slub.c 2018-03-20 14:59:20.528030000 +0100 +++ linux-2.6/mm/slub.c 2018-03-20 14:59:20.518030000 +0100 @@ -3234,7 +3234,7 @@ static inline int slab_order(int size, i return order; } -static inline int calculate_order(int size, int reserved) +static inline int calculate_order(int size, int reserved, slab_flags_t flags) { int order; int min_objects; @@ -3261,7 +3261,7 @@ static inline int calculate_order(int si order = slab_order(size, min_objects, slub_max_order, fraction, reserved); if (order <= slub_max_order) - return order; + goto ret_order; fraction /= 2; } min_objects--; @@ -3273,15 +3273,30 @@ static inline int calculate_order(int si */ order = slab_order(size, 1, slub_max_order, 1, reserved); if (order <= slub_max_order) - return order; + goto ret_order; /* * Doh this slab cannot be placed using slub_max_order. */ order = slab_order(size, 1, MAX_ORDER, 1, reserved); if (order < MAX_ORDER) - return order; + goto ret_order; return -ENOSYS; + +ret_order: + if (flags & SLAB_MINIMIZE_WASTE) { + /* Increase the order if it decreases waste */ + int test_order; + for (test_order = order + 1; test_order < MAX_ORDER; test_order++) { + unsigned long order_objects = ((PAGE_SIZE << order) - reserved) / size; + unsigned long test_order_objects = ((PAGE_SIZE << test_order) - reserved) / size; + if (test_order_objects >= min(64, MAX_OBJS_PER_PAGE)) + break; + if (test_order_objects > order_objects << (test_order - order)) + order = test_order; + } + } + return order; } static void @@ -3546,7 +3561,7 @@ static int calculate_sizes(struct kmem_c if (forced_order >= 0) order = forced_order; else - order = calculate_order(size, s->reserved); + order = calculate_order(size, s->reserved, flags); if (order < 0) return 0; Index: linux-2.6/mm/slab.h =================================================================== --- linux-2.6.orig/mm/slab.h 2018-03-20 14:59:20.528030000 +0100 +++ linux-2.6/mm/slab.h 2018-03-20 14:59:20.518030000 +0100 @@ -142,10 +142,10 @@ static inline slab_flags_t kmem_cache_fl #if defined(CONFIG_SLAB) #define SLAB_CACHE_FLAGS (SLAB_MEM_SPREAD | SLAB_NOLEAKTRACE | \ SLAB_RECLAIM_ACCOUNT | SLAB_TEMPORARY | \ - SLAB_ACCOUNT) + SLAB_ACCOUNT | SLAB_MINIMIZE_WASTE) #elif defined(CONFIG_SLUB) #define SLAB_CACHE_FLAGS (SLAB_NOLEAKTRACE | SLAB_RECLAIM_ACCOUNT | \ - SLAB_TEMPORARY | SLAB_ACCOUNT) + SLAB_TEMPORARY | SLAB_ACCOUNT | SLAB_MINIMIZE_WASTE) #else #define SLAB_CACHE_FLAGS (0) #endif @@ -164,7 +164,8 @@ static inline slab_flags_t kmem_cache_fl SLAB_NOLEAKTRACE | \ SLAB_RECLAIM_ACCOUNT | \ SLAB_TEMPORARY | \ - SLAB_ACCOUNT) + SLAB_ACCOUNT | \ + SLAB_MINIMIZE_WASTE) int __kmem_cache_shutdown(struct kmem_cache *); void __kmem_cache_release(struct kmem_cache *); Index: linux-2.6/mm/slab.c =================================================================== --- linux-2.6.orig/mm/slab.c 2018-03-20 14:59:20.528030000 +0100 +++ linux-2.6/mm/slab.c 2018-03-20 14:59:20.518030000 +0100 @@ -1789,14 +1789,14 @@ static size_t calculate_slab_order(struc * as GFP_NOFS and we really don't want to have to be allocating * higher-order pages when we are unable to shrink dcache. */ - if (flags & SLAB_RECLAIM_ACCOUNT) + if (flags & SLAB_RECLAIM_ACCOUNT && !(flags & SLAB_MINIMIZE_WASTE)) break; /* * Large number of objects is good, but very large slabs are * currently bad for the gfp()s. */ - if (gfporder >= slab_max_order) + if (gfporder >= slab_max_order && !(flags & SLAB_MINIMIZE_WASTE)) break; /*