Message ID | 20200818080415.7531-1-hyesoo.yu@samsung.com (mailing list archive) |
---|---|
Headers | show |
Series | Chunk Heap Support on DMA-HEAP | expand |
Hi, On Tue, Aug 18, 2020 at 05:04:12PM +0900, Hyesoo Yu wrote: > These patch series to introduce a new dma heap, chunk heap. > That heap is needed for special HW that requires bulk allocation of > fixed high order pages. For example, 64MB dma-buf pages are made up > to fixed order-4 pages * 1024. > > The chunk heap uses alloc_pages_bulk to allocate high order page. > https://lore.kernel.org/linux-mm/20200814173131.2803002-1-minchan@kernel.org > > The chunk heap is registered by device tree with alignment and memory node > of contiguous memory allocator(CMA). Alignment defines chunk page size. > For example, alignment 0x1_0000 means chunk page size is 64KB. > The phandle to memory node indicates contiguous memory allocator(CMA). > If device node doesn't have cma, the registration of chunk heap fails. This reminds me of an ion heap developed at Arm several years ago: https://git.linaro.org/landing-teams/working/arm/kernel.git/tree/drivers/staging/android/ion/ion_compound_page.c Some more descriptive text here: https://github.com/ARM-software/CPA It maintains a pool of high-order pages with a worker thread to attempt compaction and allocation to keep the pool filled, with high and low watermarks to trigger freeing/allocating of chunks. It implements a shrinker to allow the system to reclaim the pool under high memory pressure. Is maintaining a pool something you considered? From the alloc_pages_bulk thread it sounds like you want to allocate 300M at a time, so I expect if you tuned the pool size to match that it could work quite well. That implementation isn't using a CMA region, but a similar approach could definitely be applied. Thanks, -Brian > > The patchset includes the following: > - export dma-heap API to register kernel module dma heap. > - add chunk heap implementation. > - document of device tree to register chunk heap > > Hyesoo Yu (3): > dma-buf: add missing EXPORT_SYMBOL_GPL() for dma heaps > dma-buf: heaps: add chunk heap to dmabuf heaps > dma-heap: Devicetree binding for chunk heap > > .../devicetree/bindings/dma-buf/chunk_heap.yaml | 46 +++++ > drivers/dma-buf/dma-heap.c | 2 + > drivers/dma-buf/heaps/Kconfig | 9 + > drivers/dma-buf/heaps/Makefile | 1 + > drivers/dma-buf/heaps/chunk_heap.c | 222 +++++++++++++++++++++ > drivers/dma-buf/heaps/heap-helpers.c | 2 + > 6 files changed, 282 insertions(+) > create mode 100644 Documentation/devicetree/bindings/dma-buf/chunk_heap.yaml > create mode 100644 drivers/dma-buf/heaps/chunk_heap.c > > -- > 2.7.4 >
On Tue, Aug 18, 2020 at 12:45 AM Hyesoo Yu <hyesoo.yu@samsung.com> wrote: > > These patch series to introduce a new dma heap, chunk heap. > That heap is needed for special HW that requires bulk allocation of > fixed high order pages. For example, 64MB dma-buf pages are made up > to fixed order-4 pages * 1024. > > The chunk heap uses alloc_pages_bulk to allocate high order page. > https://lore.kernel.org/linux-mm/20200814173131.2803002-1-minchan@kernel.org > > The chunk heap is registered by device tree with alignment and memory node > of contiguous memory allocator(CMA). Alignment defines chunk page size. > For example, alignment 0x1_0000 means chunk page size is 64KB. > The phandle to memory node indicates contiguous memory allocator(CMA). > If device node doesn't have cma, the registration of chunk heap fails. > > The patchset includes the following: > - export dma-heap API to register kernel module dma heap. > - add chunk heap implementation. > - document of device tree to register chunk heap > > Hyesoo Yu (3): > dma-buf: add missing EXPORT_SYMBOL_GPL() for dma heaps > dma-buf: heaps: add chunk heap to dmabuf heaps > dma-heap: Devicetree binding for chunk heap Hey! Thanks so much for sending this out! I'm really excited to see these heaps be submitted and reviewed on the list! The first general concern I have with your series is that it adds a dt binding for the chunk heap, which we've gotten a fair amount of pushback on. A possible alternative might be something like what Kunihiko Hayashi proposed for non-default CMA heaps: https://lore.kernel.org/lkml/1594948208-4739-1-git-send-email-hayashi.kunihiko@socionext.com/ This approach would insteal allow a driver to register a CMA area with the chunk heap implementation. However, (and this was the catch Kunihiko Hayashi's patch) this requires that the driver also be upstream, as we need an in-tree user of such code. Also, it might be good to provide some further rationale on why this heap is beneficial over the existing CMA heap? In general focusing the commit messages more on the why we might want the patch, rather than what the patch does, is helpful. "Special hardware" that doesn't have upstream drivers isn't very compelling for most maintainers. That said, I'm very excited to see these sorts of submissions, as I know lots of vendors have historically had very custom out of tree ION heaps, and I think it would be a great benefit to the community to better understand the experience vendors have in optimizing performance on their devices, so we can create good common solutions upstream. So I look forward to your insights on future revisions of this patch series! thanks -john
On Tue, Aug 18, 2020 at 11:55:57AM +0100, Brian Starkey wrote: > Hi, > > On Tue, Aug 18, 2020 at 05:04:12PM +0900, Hyesoo Yu wrote: > > These patch series to introduce a new dma heap, chunk heap. > > That heap is needed for special HW that requires bulk allocation of > > fixed high order pages. For example, 64MB dma-buf pages are made up > > to fixed order-4 pages * 1024. > > > > The chunk heap uses alloc_pages_bulk to allocate high order page. > > https://lore.kernel.org/linux-mm/20200814173131.2803002-1-minchan@kernel.org > > > > The chunk heap is registered by device tree with alignment and memory node > > of contiguous memory allocator(CMA). Alignment defines chunk page size. > > For example, alignment 0x1_0000 means chunk page size is 64KB. > > The phandle to memory node indicates contiguous memory allocator(CMA). > > If device node doesn't have cma, the registration of chunk heap fails. > > This reminds me of an ion heap developed at Arm several years ago: > https://protect2.fireeye.com/v1/url?k=aceed8af-f122140a-acef53e0-0cc47a30d446-0980fa451deb2df6&q=1&e=a58a9bb0-a837-4fc5-970e-907089bfe25e&u=https%3A%2F%2Fgit.linaro.org%2Flanding-teams%2Fworking%2Farm%2Fkernel.git%2Ftree%2Fdrivers%2Fstaging%2Fandroid%2Fion%2Fion_compound_page.c > > Some more descriptive text here: > https://protect2.fireeye.com/v1/url?k=83dc3e8b-de10f22e-83ddb5c4-0cc47a30d446-a406aa201ca7dddc&q=1&e=a58a9bb0-a837-4fc5-970e-907089bfe25e&u=https%3A%2F%2Fgithub.com%2FARM-software%2FCPA > > It maintains a pool of high-order pages with a worker thread to > attempt compaction and allocation to keep the pool filled, with high > and low watermarks to trigger freeing/allocating of chunks. > It implements a shrinker to allow the system to reclaim the pool under > high memory pressure. > > Is maintaining a pool something you considered? From the > alloc_pages_bulk thread it sounds like you want to allocate 300M at a > time, so I expect if you tuned the pool size to match that it could > work quite well. > > That implementation isn't using a CMA region, but a similar approach > could definitely be applied. > I have seriously considered CPA in our product but we developed our own because of the pool in CPA. The high-order pages are required by some specific users like Netflix app. Moreover required number of bytes are dramatically increasing because of high resolution videos and displays in these days. Gathering lots of free high-order pages in the background during run-time means reserving that amount of pages from the entier available system memory. Moreover the gathered pages are soon reclaimed whenever the system is sufferring from memory pressure (i.e. camera recording, heavy games). So we had to consider allocating hundreds of megabytes at at time. Of course we don't allocate all buffers by a single call to alloc_pages_bulk(). But still a buffer is very large. A single frame of 8K HDR video needs 95MB (7680*4320*2*1.5). Even a single frame of HDR 4K video needs 24MB and 4K HDR is now popular in Netflix, YouTube and Google Play video. > Thanks, > -Brian Thank you! KyongHo
Hi KyongHo, On Wed, Aug 19, 2020 at 12:46:26PM +0900, Cho KyongHo wrote: > I have seriously considered CPA in our product but we developed our own > because of the pool in CPA. Oh good, I'm glad you considered it :-) > The high-order pages are required by some specific users like Netflix > app. Moreover required number of bytes are dramatically increasing > because of high resolution videos and displays in these days. > > Gathering lots of free high-order pages in the background during > run-time means reserving that amount of pages from the entier available > system memory. Moreover the gathered pages are soon reclaimed whenever > the system is sufferring from memory pressure (i.e. camera recording, > heavy games). Aren't these two things in contradiction? If they're easily reclaimed then they aren't "reserved" in any detrimental way. And if you don't want them to be reclaimed, then you need them to be reserved... The approach you have here assigns the chunk of memory as a reserved CMA region which the kernel is going to try not to use too - similar to the CPA pool. I suppose it's a balance depending on how much you're willing to wait for migration on the allocation path. CPA has the potential to get you faster allocations, but the downside is you need to make it a little more "greedy". Cheers, -Brian
Hi Brain, On Wed, Aug 19, 2020 at 02:22:04PM +0100, Brian Starkey wrote: > Hi KyongHo, > > On Wed, Aug 19, 2020 at 12:46:26PM +0900, Cho KyongHo wrote: > > I have seriously considered CPA in our product but we developed our own > > because of the pool in CPA. > > Oh good, I'm glad you considered it :-) > > > The high-order pages are required by some specific users like Netflix > > app. Moreover required number of bytes are dramatically increasing > > because of high resolution videos and displays in these days. > > > > Gathering lots of free high-order pages in the background during > > run-time means reserving that amount of pages from the entier available > > system memory. Moreover the gathered pages are soon reclaimed whenever > > the system is sufferring from memory pressure (i.e. camera recording, > > heavy games). > > Aren't these two things in contradiction? If they're easily reclaimed > then they aren't "reserved" in any detrimental way. And if you don't > want them to be reclaimed, then you need them to be reserved... > > The approach you have here assigns the chunk of memory as a reserved > CMA region which the kernel is going to try not to use too - similar > to the CPA pool. > > I suppose it's a balance depending on how much you're willing to wait > for migration on the allocation path. CPA has the potential to get you > faster allocations, but the downside is you need to make it a little > more "greedy". > I understand why you think it as contradiction. But I don't think so. Kernel page allocator now prefers free pages in CMA when allocating movable pages by commit https://lore.kernel.org/linux-mm/CAAmzW4P6+3O_RLvgy_QOKD4iXw+Hk3HE7Toc4Ky7kvQbCozCeA@mail.gmail.com/ . We are trying to reduce unused pages to improve performance. So, unused pages in a pool should be easily reclaimed. That is why we does not secure free pages in a special pool for a specific usecase. Instead we have tried to reduce performance bottle-necks in page migration to allocate large amount memory when the memory is needed.