diff mbox

[08/10] mm: cma: Contiguous Memory Allocator added

Message ID 1307699698-29369-9-git-send-email-m.szyprowski@samsung.com (mailing list archive)
State Not Applicable
Headers show

Commit Message

Marek Szyprowski June 10, 2011, 9:54 a.m. UTC
The Contiguous Memory Allocator is a set of functions that lets
one initialise a region of memory which then can be used to perform
allocations of contiguous memory chunks from.

CMA allows for creation of separate contexts. Kernel is allowed to
allocate movable pages within CMA's managed memory so that it can be
used for page cache when CMA devices do not use it. On cm_alloc()
request such pages are migrated out of CMA area to free required
contiguous block.

This code is heavily based on earlier works by Michal Nazarewicz.

Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
CC: Michal Nazarewicz <mina86@mina86.com>
---
 include/linux/cma.h |  189 +++++++++++++++++++++++++++
 mm/Kconfig          |   21 +++
 mm/Makefile         |    1 +
 mm/cma.c            |  358 +++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 569 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/cma.h
 create mode 100644 mm/cma.c

Comments

Arnd Bergmann June 10, 2011, 4:21 p.m. UTC | #1
On Friday 10 June 2011, Marek Szyprowski wrote:
>The Contiguous Memory Allocator is a set of functions that lets
>one initialise a region of memory which then can be used to perform
>allocations of contiguous memory chunks from.
>
>CMA allows for creation of separate contexts. Kernel is allowed to
>allocate movable pages within CMA's managed memory so that it can be
>used for page cache when CMA devices do not use it. On cm_alloc()
>request such pages are migrated out of CMA area to free required
>contiguous block.

Hi Marek,

I'm generally happy with the patches 1 through 7, i.e the heavy lifting
to make contiguous allocations work. Thank you very much for keeping
up the work and submitting these in a good shape.

I do think that we need to discuss the driver-visible API a bit more.
My feeling is that this is rather un-Linux-like and it needs to be
simplified some more. Of course, I don't mind being overruled by the
memory management experts here, or if you can argue that it's really
the right way to do it.

> + * Driver usage
> + *
> + *   CMA should not be used directly by the device drivers. It should
> + *   be considered as helper framework for dma-mapping subsystm and
> + *   respective (platform)bus drivers.
> + *
> + *   The CMA client needs to have a pointer to a CMA context
> + *   represented by a struct cma (which is an opaque data type).
> + *
> + *   Once such pointer is obtained, a caller may allocate contiguous
> + *   memory chunk using the following function:
> + *
> + *     cm_alloc()
> + *
> + *   This function returns a pointer to the first struct page which
> + *   represent a contiguous memory chunk.  This pointer
> + *   may be used with the following function:
> + *
> + *     cm_free()    -- frees allocated contiguous memory

Please explain why you want a new top-level API here. I think it
would be much nicer if a device driver could simply call 
alloc_pages(..., GFP_CMA) or similar, where all the complexity
in here gets hidden behind a conditional deep inside of the
page allocator.

Evidently, the two functions you describe here have an extra argument
for the context. Can you elaborate why that is really needed? What
is the specific requirement to have multiple such contexts in one
system and what is the worst-case effect that you would get when
the interface is simplified to have only one for all devices?

Alternatively, would it be possible to provide the cm_alloc/cm_free
functions only as backing to the dma mapping API and not export them
as a generic device driver interface?

> + * Platform/machine integration
> + *
> + *   CMA context must be created on platform or machine initialisation
> + *   and passed to respective subsystem that will be a client for CMA.
> + *   The latter may be done by a global variable or some filed in
> + *   struct device.  For the former CMA provides the following functions:
> + *
> + *     cma_init_migratetype()
> + *     cma_reserve()
> + *     cma_create()
> + *
> + *   The first one initialises a portion of reserved memory so that it
> + *   can be used with CMA.  The second first tries to reserve memory
> + *   (using memblock) and then initialise it.
> + *
> + *   The cma_reserve() function must be called when memblock is still
> + *   operational and reserving memory with it is still possible.  On
> + *   ARM platform the "reserve" machine callback is a perfect place to
> + *   call it.
> + *
> + *   The last function creates a CMA context on a range of previously
> + *   initialised memory addresses.  Because it uses kmalloc() it needs
> + *   to be called after SLAB is initialised.
> + */

This interface looks flawed to me for multiple reasons:

* It requires you to call three distinct functions in order to do one
  thing, and they all take the same arguments (more or less). Why not
  have one function call at the latest possible point where you can
  still change the memory attributes, and have everything else
  happen automatically?

* It requires you to pass the exact location of the area. I can see why
  you want that on certain machines that require DMA areas to be spread
  across multiple memory buses, but IMHO it's not appropriate for a
  generic API.

* It requires you to hardcode the size in a machine specific source file.
  This probably seems to be a natural thing to do when you have worked a
  lot on the ARM architecture, but it's really not. We really want to
  get rid of board files and replace them with generic probing based on
  the device tree, and the size of the area is more policy than a property
  of the hardware that can be accurately described in the device tree or
  a board file.

I'm sure that we can find a solution for all of these problems, it just
takes a few good ideas. Especially for the last point, I don't have a
better suggestion yet, but hopefully someone else can come up with one.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marek Szyprowski June 13, 2011, 9:05 a.m. UTC | #2
Hello,

On Friday, June 10, 2011 6:22 PM Arnd Bergmann wrote:

> On Friday 10 June 2011, Marek Szyprowski wrote:
> >The Contiguous Memory Allocator is a set of functions that lets
> >one initialise a region of memory which then can be used to perform
> >allocations of contiguous memory chunks from.
> >
> >CMA allows for creation of separate contexts. Kernel is allowed to
> >allocate movable pages within CMA's managed memory so that it can be
> >used for page cache when CMA devices do not use it. On cm_alloc()
> >request such pages are migrated out of CMA area to free required
> >contiguous block.
> 
> Hi Marek,
> 
> I'm generally happy with the patches 1 through 7, i.e the heavy lifting
> to make contiguous allocations work. Thank you very much for keeping
> up the work and submitting these in a good shape.
> 
> I do think that we need to discuss the driver-visible API a bit more.
> My feeling is that this is rather un-Linux-like and it needs to be
> simplified some more. Of course, I don't mind being overruled by the
> memory management experts here, or if you can argue that it's really
> the right way to do it.

Thanks for your comments!
 
> > + * Driver usage
> > + *
> > + *   CMA should not be used directly by the device drivers. It should
> > + *   be considered as helper framework for dma-mapping subsystm and
> > + *   respective (platform)bus drivers.
> > + *
> > + *   The CMA client needs to have a pointer to a CMA context
> > + *   represented by a struct cma (which is an opaque data type).
> > + *
> > + *   Once such pointer is obtained, a caller may allocate contiguous
> > + *   memory chunk using the following function:
> > + *
> > + *     cm_alloc()
> > + *
> > + *   This function returns a pointer to the first struct page which
> > + *   represent a contiguous memory chunk.  This pointer
> > + *   may be used with the following function:
> > + *
> > + *     cm_free()    -- frees allocated contiguous memory
> 
> Please explain why you want a new top-level API here. I think it
> would be much nicer if a device driver could simply call
> alloc_pages(..., GFP_CMA) or similar, where all the complexity
> in here gets hidden behind a conditional deep inside of the
> page allocator.
> 
> Evidently, the two functions you describe here have an extra argument
> for the context. Can you elaborate why that is really needed? What
> is the specific requirement to have multiple such contexts in one
> system and what is the worst-case effect that you would get when
> the interface is simplified to have only one for all devices?
> 
> Alternatively, would it be possible to provide the cm_alloc/cm_free
> functions only as backing to the dma mapping API and not export them
> as a generic device driver interface?

cm_alloc/free are definitely not meant to be called from device drivers.
They should be only considered as a backend for dma-mapping.

'Raw' contiguous memory block doesn't really make sense for the device
drivers. What the drivers require is a contiguous memory block that is
somehow mapped into respective bus address space, so dma-mapping 
framework is the right interface.

alloc_pages(..., GFP_CMA) looks nice but in fact it is really impractical.
The driver will need to map such buffer to dma context anyway, so imho
dma_alloc_attributed() will give the drivers much more flexibility. In
terms of dma-mapping the context argument isn't anything odd. 

If possible I would like to make cma something similar to 
declare_dma_coherent()&friends, so the board/platform/bus startup code
will just call declare_dma_contiguous() to enable support for cma for
particular devices.

> > + * Platform/machine integration
> > + *
> > + *   CMA context must be created on platform or machine initialisation
> > + *   and passed to respective subsystem that will be a client for CMA.
> > + *   The latter may be done by a global variable or some filed in
> > + *   struct device.  For the former CMA provides the following
> functions:
> > + *
> > + *     cma_init_migratetype()
> > + *     cma_reserve()
> > + *     cma_create()
> > + *
> > + *   The first one initialises a portion of reserved memory so that it
> > + *   can be used with CMA.  The second first tries to reserve memory
> > + *   (using memblock) and then initialise it.
> > + *
> > + *   The cma_reserve() function must be called when memblock is still
> > + *   operational and reserving memory with it is still possible.  On
> > + *   ARM platform the "reserve" machine callback is a perfect place to
> > + *   call it.
> > + *
> > + *   The last function creates a CMA context on a range of previously
> > + *   initialised memory addresses.  Because it uses kmalloc() it needs
> > + *   to be called after SLAB is initialised.
> > + */
> 
> This interface looks flawed to me for multiple reasons:
> 
> * It requires you to call three distinct functions in order to do one
>   thing, and they all take the same arguments (more or less). Why not
>   have one function call at the latest possible point where you can
>   still change the memory attributes, and have everything else
>   happen automatically?

Initialization part will be definitely simplified, I must confess that I
was in hurry to post the patches before the weekend and just forgot to
cleanup this part...
 
> * It requires you to pass the exact location of the area. I can see why
>   you want that on certain machines that require DMA areas to be spread
>   across multiple memory buses, but IMHO it's not appropriate for a
>   generic API.

IMHO we can also use some NULL context to indicate some global, system 
wide CMA area and again -> in terms of dma-mapping api having a context 
isn't anything uncommon.
 
> * It requires you to hardcode the size in a machine specific source file.
>   This probably seems to be a natural thing to do when you have worked a
>   lot on the ARM architecture, but it's really not. We really want to
>   get rid of board files and replace them with generic probing based on
>   the device tree, and the size of the area is more policy than a property
>   of the hardware that can be accurately described in the device tree or
>   a board file.

The problem is the fact that right now, we still have board files and we
have to live with them for a while (with all advantages and disadvantages).
I hope that you won't require me to rewrite the whole support for all ARM 
platforms to get rid of board files to get CMA merged ;)

I see no problem defining CMA areas in device tree, as this is something
really specific to particular board configuration. 

> I'm sure that we can find a solution for all of these problems, it just
> takes a few good ideas. Especially for the last point, I don't have a
> better suggestion yet, but hopefully someone else can come up with one.

I hope we will manage to get agreement :)

Best regards
Arnd Bergmann June 14, 2011, 1:49 p.m. UTC | #3
On Monday 13 June 2011, Marek Szyprowski wrote:
> cm_alloc/free are definitely not meant to be called from device drivers.
> They should be only considered as a backend for dma-mapping.
>
> 'Raw' contiguous memory block doesn't really make sense for the device
> drivers. What the drivers require is a contiguous memory block that is
> somehow mapped into respective bus address space, so dma-mapping 
> framework is the right interface.
> 
> alloc_pages(..., GFP_CMA) looks nice but in fact it is really impractical.
> The driver will need to map such buffer to dma context anyway, so imho
> dma_alloc_attributed() will give the drivers much more flexibility. In
> terms of dma-mapping the context argument isn't anything odd. 

Ok.

> If possible I would like to make cma something similar to 
> declare_dma_coherent()&friends, so the board/platform/bus startup code
> will just call declare_dma_contiguous() to enable support for cma for
> particular devices.

Sounds good, I like that.

> > This interface looks flawed to me for multiple reasons:
> > 
> > * It requires you to call three distinct functions in order to do one
> >   thing, and they all take the same arguments (more or less). Why not
> >   have one function call at the latest possible point where you can
> >   still change the memory attributes, and have everything else
> >   happen automatically?
> 
> Initialization part will be definitely simplified, I must confess that I
> was in hurry to post the patches before the weekend and just forgot to
> cleanup this part...

Fair enough. In cases like this, it's often good to add a TODO section
to the patch description, or a FIXME comment in the patch itself.

> > * It requires you to pass the exact location of the area. I can see why
> >   you want that on certain machines that require DMA areas to be spread
> >   across multiple memory buses, but IMHO it's not appropriate for a
> >   generic API.
> 
> IMHO we can also use some NULL context to indicate some global, system 
> wide CMA area and again -> in terms of dma-mapping api having a context 
> isn't anything uncommon.

Please explain the exact requirements that lead you to defining multiple
contexts. My feeling is that in most cases we don't need them and can
simply live with a single area. Depending on how obscure the cases are
where we do need something beyond that, we can then come up with
special-case solutions for them.

> > * It requires you to hardcode the size in a machine specific source file.
> >   This probably seems to be a natural thing to do when you have worked a
> >   lot on the ARM architecture, but it's really not. We really want to
> >   get rid of board files and replace them with generic probing based on
> >   the device tree, and the size of the area is more policy than a property
> >   of the hardware that can be accurately described in the device tree or
> >   a board file.
> 
> The problem is the fact that right now, we still have board files and we
> have to live with them for a while (with all advantages and disadvantages).
> I hope that you won't require me to rewrite the whole support for all ARM 
> platforms to get rid of board files to get CMA merged ;)

Of course not. But we need to know what we want a platform with device
tree support to look like when it's using CMA, so we don't make it
harder to change the platforms over than it already is.

> I see no problem defining CMA areas in device tree, as this is something
> really specific to particular board configuration. 

The problem with defining CMA areas in the device tree is that it's not
a property of the hardware, but really policy. The device tree file
should not contain anything related to a specific operating system
because you might want to boot something on the board that doesn't
know about CMA, and even when you only care about using Linux, the
implementation might change to the point where hardcoded CMA areas
no longer make sense.

IMHO we should first aim for a solution that works almost everywhere
without the kernel knowing what board it's running on, and then we
can add quirks for devices that have special requirements. I think
the situation is similar to the vmalloc virtual address space, which
normally has a hardcoded size that works almost everywhere, but there
are certain drivers etc that require much more, or there are situations
where you want to make it smaller in order to avoid highmem. 

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michał Nazarewicz June 14, 2011, 1:55 p.m. UTC | #4
On Tue, 14 Jun 2011 15:49:29 +0200, Arnd Bergmann <arnd@arndb.de> wrote:
> Please explain the exact requirements that lead you to defining multiple
> contexts.

Some devices may have access only to some banks of memory.  Some devices
may use different banks of memory for different purposes.
Arnd Bergmann June 14, 2011, 4:03 p.m. UTC | #5
On Tuesday 14 June 2011, Michal Nazarewicz wrote:
> On Tue, 14 Jun 2011 15:49:29 +0200, Arnd Bergmann <arnd@arndb.de> wrote:
> > Please explain the exact requirements that lead you to defining multiple
> > contexts.
> 
> Some devices may have access only to some banks of memory.  Some devices
> may use different banks of memory for different purposes.

For all I know, that is something that is only true for a few very special
Samsung devices, and is completely unrelated of the need for contiguous
allocations, so this approach becomes pointless as soon as the next
generation of that chip grows an IOMMU, where we don't handle the special
bank attributes. Also, the way I understood the situation for the Samsung
SoC during the Budapest discussion, it's only a performance hack, not a
functional requirement, unless you count '1080p playback' as a functional
requirement.

Supporting contiguous allocation is a very useful goal and many people want
this, but supporting a crazy one-off hardware design with lots of generic
infrastructure is going a bit too far. If you can't be more specific than
'some devices may need this', I would suggest going forward without having
multiple regions:

* Remove the registration of specific addresses from the initial patch
  set (but keep the patch).
* Add a heuristic plus command-line override to automatically come up
  with a reasonable location+size for *one* CMA area in the system.
* Ship the patch to add support for multiple CMA areas with the BSP
  for the boards that need it (if any).
* Wait for someone on a non-Samsung SoC to run into the same problem,
  then have /them/ get the final patch in.

Even if you think you can convince enough people that having support
for distinct predefined regions is a good idea, I would recommend
splitting that out of the initial merge so we can have that discussion
separately from the other issues.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michał Nazarewicz June 14, 2011, 4:58 p.m. UTC | #6
>> On Tue, 14 Jun 2011 15:49:29 +0200, Arnd Bergmann <arnd@arndb.de> wrote:
>>> Please explain the exact requirements that lead you to defining  
>>> multiple contexts.

> On Tuesday 14 June 2011, Michal Nazarewicz wrote:
>> Some devices may have access only to some banks of memory.  Some devices
>> may use different banks of memory for different purposes.

On Tue, 14 Jun 2011 18:03:00 +0200, Arnd Bergmann wrote:
> For all I know, that is something that is only true for a few very  
> special Samsung devices,

Maybe.  I'm just answering your question. :)

Ah yes, I forgot that separate regions for different purposes could
decrease fragmentation.

> I would suggest going forward without having multiple regions:

Is having support for multiple regions a bad thing?  Frankly,
removing this support will change code from reading context passed
as argument to code reading context from global variable.  Nothing
is gained; functionality is lost.

> * Remove the registration of specific addresses from the initial patch
>   set (but keep the patch).
> * Add a heuristic plus command-line override to automatically come up
>   with a reasonable location+size for *one* CMA area in the system.

I'm not arguing those.
Daniel Stone June 14, 2011, 5:01 p.m. UTC | #7
Hi,

On Tue, Jun 14, 2011 at 06:03:00PM +0200, Arnd Bergmann wrote:
> On Tuesday 14 June 2011, Michal Nazarewicz wrote:
> > On Tue, 14 Jun 2011 15:49:29 +0200, Arnd Bergmann <arnd@arndb.de> wrote:
> > > Please explain the exact requirements that lead you to defining multiple
> > > contexts.
> > 
> > Some devices may have access only to some banks of memory.  Some devices
> > may use different banks of memory for different purposes.
> 
> For all I know, that is something that is only true for a few very special
> Samsung devices, and is completely unrelated of the need for contiguous
> allocations, so this approach becomes pointless as soon as the next
> generation of that chip grows an IOMMU, where we don't handle the special
> bank attributes. Also, the way I understood the situation for the Samsung
> SoC during the Budapest discussion, it's only a performance hack, not a
> functional requirement, unless you count '1080p playback' as a functional
> requirement.

Hm, I think that was something similar but not quite the same: talking
about having allocations split to lie between two banks of RAM to
maximise the read/write speed for performance reasons.  That's something
that can be handled in the allocator, rather than an API constraint, as
this is.

Not that I know of any hardware which is limited as such, but eh.

Cheers,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann June 14, 2011, 6:30 p.m. UTC | #8
On Tuesday 14 June 2011 18:58:35 Michal Nazarewicz wrote:
> On Tue, 14 Jun 2011 18:03:00 +0200, Arnd Bergmann wrote:
> > For all I know, that is something that is only true for a few very  
> > special Samsung devices,
> 
> Maybe.  I'm just answering your question. :)
> 
> Ah yes, I forgot that separate regions for different purposes could
> decrease fragmentation.

That is indeed a good point, but having a good allocator algorithm
could also solve this. I don't know too much about these allocation
algorithms, but there are probably multiple working approaches to this.

> > I would suggest going forward without having multiple regions:
> 
> Is having support for multiple regions a bad thing?  Frankly,
> removing this support will change code from reading context passed
> as argument to code reading context from global variable.  Nothing
> is gained; functionality is lost.

What is bad IMHO is making them the default, which forces the board
code to care about memory management details. I would much prefer
to have contiguous allocation parameters tuned automatically to just
work on most boards before we add ways to do board-specific hacks.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michał Nazarewicz June 14, 2011, 6:40 p.m. UTC | #9
> On Tuesday 14 June 2011 18:58:35 Michal Nazarewicz wrote:
>> Is having support for multiple regions a bad thing?  Frankly,
>> removing this support will change code from reading context passed
>> as argument to code reading context from global variable.  Nothing
>> is gained; functionality is lost.

On Tue, 14 Jun 2011 20:30:07 +0200, Arnd Bergmann wrote:
> What is bad IMHO is making them the default, which forces the board
> code to care about memory management details. I would much prefer
> to have contiguous allocation parameters tuned automatically to just
> work on most boards before we add ways to do board-specific hacks.

I see those as orthogonal problems.  The code can have support for
multiple contexts but by default use a single global context exported
as cma_global variable (or some such).

And I'm not arguing against having “contiguous allocation parameters
tuned automatically to just work on most boards”.  I just don't see
the reason to delete functionality that is already there, does not
add much code and can be useful.
Zach Pfeffer June 14, 2011, 6:58 p.m. UTC | #10
On 14 June 2011 12:01, Daniel Stone <daniels@collabora.com> wrote:
> Hi,
>
> On Tue, Jun 14, 2011 at 06:03:00PM +0200, Arnd Bergmann wrote:
>> On Tuesday 14 June 2011, Michal Nazarewicz wrote:
>> > On Tue, 14 Jun 2011 15:49:29 +0200, Arnd Bergmann <arnd@arndb.de> wrote:
>> > > Please explain the exact requirements that lead you to defining multiple
>> > > contexts.
>> >
>> > Some devices may have access only to some banks of memory.  Some devices
>> > may use different banks of memory for different purposes.
>>
>> For all I know, that is something that is only true for a few very special
>> Samsung devices, and is completely unrelated of the need for contiguous
>> allocations, so this approach becomes pointless as soon as the next
>> generation of that chip grows an IOMMU, where we don't handle the special
>> bank attributes. Also, the way I understood the situation for the Samsung
>> SoC during the Budapest discussion, it's only a performance hack, not a
>> functional requirement, unless you count '1080p playback' as a functional
>> requirement.

Coming in mid topic...

I've seen this split bank allocation in Qualcomm and TI SoCs, with
Samsung, that makes 3 major SoC vendors (I would be surprised if
Nvidia didn't also need to do this) - so I think some configurable
method to control allocations is necessarily. The chips can't do
decode without it (and by can't do I mean 1080P and higher decode is
not functionally useful). Far from special, this would appear to be
the default.

> Hm, I think that was something similar but not quite the same: talking
> about having allocations split to lie between two banks of RAM to
> maximise the read/write speed for performance reasons.  That's something
> that can be handled in the allocator, rather than an API constraint, as
> this is.
>
> Not that I know of any hardware which is limited as such, but eh.
>
> Cheers,
> Daniel
>
> _______________________________________________
> Linaro-mm-sig mailing list
> Linaro-mm-sig@lists.linaro.org
> http://lists.linaro.org/mailman/listinfo/linaro-mm-sig
>
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann June 14, 2011, 8:42 p.m. UTC | #11
On Tuesday 14 June 2011 20:58:25 Zach Pfeffer wrote:
> I've seen this split bank allocation in Qualcomm and TI SoCs, with
> Samsung, that makes 3 major SoC vendors (I would be surprised if
> Nvidia didn't also need to do this) - so I think some configurable
> method to control allocations is necessarily. The chips can't do
> decode without it (and by can't do I mean 1080P and higher decode is
> not functionally useful). Far from special, this would appear to be
> the default.

Thanks for the insight, that's a much better argument than 'something
may need it'. Are those all chips without an IOMMU or do we also
need to solve the IOMMU case with split bank allocation?

I think I'd still prefer to see the support for multiple regions split
out into one of the later patches, especially since that would defer
the question of how to do the initialization for this case and make
sure we first get a generic way.

You've convinced me that we need to solve the problem of allocating
memory from a specific bank eventually, but separating it from the
one at hand (contiguous allocation) should help getting the important
groundwork in at first.

The possible conflict that I still see with per-bank CMA regions are:

* It completely destroys memory power management in cases where that
  is based on powering down entire memory banks.

* We still need to solve the same problem in case of IOMMU mappings
  at some point, even if today's hardware doesn't have this combination.
  It would be good to use the same solution for both.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jordan Crouse June 14, 2011, 9:01 p.m. UTC | #12
On 06/14/2011 02:42 PM, Arnd Bergmann wrote:
> On Tuesday 14 June 2011 20:58:25 Zach Pfeffer wrote:
>> I've seen this split bank allocation in Qualcomm and TI SoCs, with
>> Samsung, that makes 3 major SoC vendors (I would be surprised if
>> Nvidia didn't also need to do this) - so I think some configurable
>> method to control allocations is necessarily. The chips can't do
>> decode without it (and by can't do I mean 1080P and higher decode is
>> not functionally useful). Far from special, this would appear to be
>> the default.
>
> Thanks for the insight, that's a much better argument than 'something
> may need it'. Are those all chips without an IOMMU or do we also
> need to solve the IOMMU case with split bank allocation?

Yes. The IOMMU case with split bank allocation is key, especially for shared
buffers. Consider the case where video is using a certain bank for performance
purposes and that frame is shared with the GPU.

Jordan
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Subash Patel June 15, 2011, 6:01 a.m. UTC | #13
Hi Arnd,

On 06/14/2011 09:33 PM, Arnd Bergmann wrote:
> On Tuesday 14 June 2011, Michal Nazarewicz wrote:
>> On Tue, 14 Jun 2011 15:49:29 +0200, Arnd Bergmann<arnd@arndb.de>  wrote:
>>> Please explain the exact requirements that lead you to defining multiple
>>> contexts.
>>
>> Some devices may have access only to some banks of memory.  Some devices
>> may use different banks of memory for different purposes.
>
> For all I know, that is something that is only true for a few very special
> Samsung devices, and is completely unrelated of the need for contiguous
> allocations, so this approach becomes pointless as soon as the next
> generation of that chip grows an IOMMU, where we don't handle the special
> bank attributes. Also, the way I understood the situation for the Samsung
> SoC during the Budapest discussion, it's only a performance hack, not a
> functional requirement, unless you count '1080p playback' as a functional
> requirement.
>
1080p@30fps is indeed a functional requirement, as the IP has the 
capability to achieve it. This IP itself can act as more than one AXI 
master, and control more than one memory port(bank). So this is not a 
*performance hack*

Also, if I recall, during the Budapest discussion (I was on irc), I 
recall that this requirement can become the information available to the 
actual allocator. Below is the summary point I could collect from summit 
notes:

     * May also need to specify more attributes (specific physical 
memory region)

As per this point, the requirement (as above) must be attribute to the 
allocator, which is CMA in this case.

> Supporting contiguous allocation is a very useful goal and many people want
> this, but supporting a crazy one-off hardware design with lots of generic
> infrastructure is going a bit too far. If you can't be more specific than
> 'some devices may need this', I would suggest going forward without having
> multiple regions:
>
> * Remove the registration of specific addresses from the initial patch
>    set (but keep the patch).
> * Add a heuristic plus command-line override to automatically come up
>    with a reasonable location+size for *one* CMA area in the system.
> * Ship the patch to add support for multiple CMA areas with the BSP
>    for the boards that need it (if any).
> * Wait for someone on a non-Samsung SoC to run into the same problem,
>    then have /them/ get the final patch in.
>
> Even if you think you can convince enough people that having support
> for distinct predefined regions is a good idea, I would recommend
> splitting that out of the initial merge so we can have that discussion
> separately from the other issues.
>
> 	Arnd
> --
> To unsubscribe from this list: send the line "unsubscribe linux-media" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Regards,
Subash
SISO-SLG
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marek Szyprowski June 15, 2011, 7:11 a.m. UTC | #14
Hello,

On Tuesday, June 14, 2011 8:30 PM Arnd Bergmann wrote:

> On Tuesday 14 June 2011 18:58:35 Michal Nazarewicz wrote:
> > On Tue, 14 Jun 2011 18:03:00 +0200, Arnd Bergmann wrote:
> > > For all I know, that is something that is only true for a few very
> > > special Samsung devices,
> >
> > Maybe.  I'm just answering your question. :)
> >
> > Ah yes, I forgot that separate regions for different purposes could
> > decrease fragmentation.
> 
> That is indeed a good point, but having a good allocator algorithm
> could also solve this. I don't know too much about these allocation
> algorithms, but there are probably multiple working approaches to this.
> 
> > > I would suggest going forward without having multiple regions:
> >
> > Is having support for multiple regions a bad thing?  Frankly,
> > removing this support will change code from reading context passed
> > as argument to code reading context from global variable.  Nothing
> > is gained; functionality is lost.
> 
> What is bad IMHO is making them the default, which forces the board
> code to care about memory management details. I would much prefer
> to have contiguous allocation parameters tuned automatically to just
> work on most boards before we add ways to do board-specific hacks.

I see your concerns, but I really wonder how to determine the properties
of the global/default cma pool. You definitely don't want to give all
available memory o CMA, because it will have negative impact on kernel
operation (kernel really needs to allocate unmovable pages from time to
time). 

The only solution I see now is to provide Kconfig entry to determine
the size of the global CMA pool, but this still have some issues,
especially for multi-board kernels (each board probably will have
different amount of RAM and different memory-consuming devices
available). It looks that each board startup code still might need to
tweak the size of CMA pool. I can add a kernel command line option for
it, but such solution also will not solve all the cases (afair there
was a discussion about kernel command line parameters for memory 
configuration and the conclusion was that it should be avoided).

Best regards
Arnd Bergmann June 15, 2011, 7:37 a.m. UTC | #15
On Wednesday 15 June 2011 09:11:39 Marek Szyprowski wrote:
> I see your concerns, but I really wonder how to determine the properties
> of the global/default cma pool. You definitely don't want to give all
> available memory o CMA, because it will have negative impact on kernel
> operation (kernel really needs to allocate unmovable pages from time to
> time). 

Exactly. This is a hard problem, so I would prefer to see a solution for
coming up with reasonable defaults.

> The only solution I see now is to provide Kconfig entry to determine
> the size of the global CMA pool, but this still have some issues,
> especially for multi-board kernels (each board probably will have
> different amount of RAM and different memory-consuming devices
> available). It looks that each board startup code still might need to
> tweak the size of CMA pool. I can add a kernel command line option for
> it, but such solution also will not solve all the cases (afair there
> was a discussion about kernel command line parameters for memory 
> configuration and the conclusion was that it should be avoided).

The command line option can be a last resort if the heuristics fail,
but it's not much better than a fixed Kconfig setting.

How about a Kconfig option that defines the percentage of memory
to set aside for contiguous allocations?

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marek Szyprowski June 15, 2011, 8:02 a.m. UTC | #16
Hello,

On Tuesday, June 14, 2011 3:49 PM Arnd Bergmann wrote:

> On Monday 13 June 2011, Marek Szyprowski wrote:
> > cm_alloc/free are definitely not meant to be called from device drivers.
> > They should be only considered as a backend for dma-mapping.
> >
> > 'Raw' contiguous memory block doesn't really make sense for the device
> > drivers. What the drivers require is a contiguous memory block that is
> > somehow mapped into respective bus address space, so dma-mapping
> > framework is the right interface.
> >
> > alloc_pages(..., GFP_CMA) looks nice but in fact it is really impractical.
> > The driver will need to map such buffer to dma context anyway, so imho
> > dma_alloc_attributed() will give the drivers much more flexibility. In
> > terms of dma-mapping the context argument isn't anything odd.
> 
> Ok.
> 
> > If possible I would like to make cma something similar to
> > declare_dma_coherent()&friends, so the board/platform/bus startup code
> > will just call declare_dma_contiguous() to enable support for cma for
> > particular devices.
> 
> Sounds good, I like that.
 
Thanks. I thought a bit more on this and decided that I want to make this
declare_dma_contiguous() optional for the drivers. It should be used only
for some sophisticated cases like for example our video codec with two
memory interfaces for 2 separate banks. By default the dma-mapping will
use system-wide cma pool.

(snipped)

> > > * It requires you to pass the exact location of the area. I can see why
> > >   you want that on certain machines that require DMA areas to be spread
> > >   across multiple memory buses, but IMHO it's not appropriate for a
> > >   generic API.
> >
> > IMHO we can also use some NULL context to indicate some global, system
> > wide CMA area and again -> in terms of dma-mapping api having a context
> > isn't anything uncommon.
> 
> Please explain the exact requirements that lead you to defining multiple
> contexts. My feeling is that in most cases we don't need them and can
> simply live with a single area. Depending on how obscure the cases are
> where we do need something beyond that, we can then come up with
> special-case solutions for them.

Like it was already stated we need such feature for our multimedia codec
to allocate buffers from different memory banks. I really don't see any
problems with the possibility to have additional cma areas for special
purposes.

> > > * It requires you to hardcode the size in a machine specific source
> file.
> > >   This probably seems to be a natural thing to do when you have worked
> a
> > >   lot on the ARM architecture, but it's really not. We really want to
> > >   get rid of board files and replace them with generic probing based on
> > >   the device tree, and the size of the area is more policy than a
> property
> > >   of the hardware that can be accurately described in the device tree
> or
> > >   a board file.
> >
> > The problem is the fact that right now, we still have board files and we
> > have to live with them for a while (with all advantages and
> disadvantages).
> > I hope that you won't require me to rewrite the whole support for all ARM
> > platforms to get rid of board files to get CMA merged ;)
> 
> Of course not. But we need to know what we want a platform with device
> tree support to look like when it's using CMA, so we don't make it
> harder to change the platforms over than it already is.
> 
> > I see no problem defining CMA areas in device tree, as this is something
> > really specific to particular board configuration.
> 
> The problem with defining CMA areas in the device tree is that it's not
> a property of the hardware, but really policy. The device tree file
> should not contain anything related to a specific operating system
> because you might want to boot something on the board that doesn't
> know about CMA, and even when you only care about using Linux, the
> implementation might change to the point where hardcoded CMA areas
> no longer make sense.

I really doubt that the device tree will carry only system-independent
information. Anyway, the preferred or required memory areas/banks for
buffer allocation is something that is a property of the hardware not
the OS policy.

> IMHO we should first aim for a solution that works almost everywhere
> without the kernel knowing what board it's running on, and then we
> can add quirks for devices that have special requirements. I think
> the situation is similar to the vmalloc virtual address space, which
> normally has a hardcoded size that works almost everywhere, but there
> are certain drivers etc that require much more, or there are situations
> where you want to make it smaller in order to avoid highmem.

I'm trying to create something that will fulfill the requirements of my
hardware, that's why I cannot focus on a generic case only.

Best regards
Marek Szyprowski June 15, 2011, 8:14 a.m. UTC | #17
Hello,

On Wednesday, June 15, 2011 9:37 AM Arnd Bergmann wrote:

> On Wednesday 15 June 2011 09:11:39 Marek Szyprowski wrote:
> > I see your concerns, but I really wonder how to determine the properties
> > of the global/default cma pool. You definitely don't want to give all
> > available memory o CMA, because it will have negative impact on kernel
> > operation (kernel really needs to allocate unmovable pages from time to
> > time).
> 
> Exactly. This is a hard problem, so I would prefer to see a solution for
> coming up with reasonable defaults.

The problem is to define these reasonable defaults, because they also depend
on the target usage pattern for the board. If one doesn't plan to use video
codec at all, then the value calculated for full HD movie decoding are 
definitely too high.

> > The only solution I see now is to provide Kconfig entry to determine
> > the size of the global CMA pool, but this still have some issues,
> > especially for multi-board kernels (each board probably will have
> > different amount of RAM and different memory-consuming devices
> > available). It looks that each board startup code still might need to
> > tweak the size of CMA pool. I can add a kernel command line option for
> > it, but such solution also will not solve all the cases (afair there
> > was a discussion about kernel command line parameters for memory
> > configuration and the conclusion was that it should be avoided).
> 
> The command line option can be a last resort if the heuristics fail,
> but it's not much better than a fixed Kconfig setting.
> 
> How about a Kconfig option that defines the percentage of memory
> to set aside for contiguous allocations?

There can be probably both types of Kconfig entries: for absolute value
and the percentage of the total memory, but still, creating a 
fully-functional multi-board kernel will be really hard.

However there is one more issue here. It is quite common that embedded
systems have memory that is not really contiguous in address space
(there are some holes that splits it into 'banks' or regions). So we
might have a system with 256MiB of memory split into 2 memory banks.
In this case the CMA pool can use only one of them (the pool must be
contiguous internally). I'm not sure if such systems might require more
memory for contiguous buffers.

Best regards
Marek Szyprowski June 15, 2011, 8:36 a.m. UTC | #18
Hello,

On Tuesday, June 14, 2011 10:42 PM Arnd Bergmann wrote:

> On Tuesday 14 June 2011 20:58:25 Zach Pfeffer wrote:
> > I've seen this split bank allocation in Qualcomm and TI SoCs, with
> > Samsung, that makes 3 major SoC vendors (I would be surprised if
> > Nvidia didn't also need to do this) - so I think some configurable
> > method to control allocations is necessarily. The chips can't do
> > decode without it (and by can't do I mean 1080P and higher decode is
> > not functionally useful). Far from special, this would appear to be
> > the default.
> 
> Thanks for the insight, that's a much better argument than 'something
> may need it'. Are those all chips without an IOMMU or do we also
> need to solve the IOMMU case with split bank allocation?
> 
> I think I'd still prefer to see the support for multiple regions split
> out into one of the later patches, especially since that would defer
> the question of how to do the initialization for this case and make
> sure we first get a generic way.
> 
> You've convinced me that we need to solve the problem of allocating
> memory from a specific bank eventually, but separating it from the
> one at hand (contiguous allocation) should help getting the important
> groundwork in at first.
>
> The possible conflict that I still see with per-bank CMA regions are:
> 
> * It completely destroys memory power management in cases where that
>   is based on powering down entire memory banks.

I don't think that per-bank CMA regions destroys memory power management
more than the global CMA pool. Please note that the contiguous buffers
(or in general dma-buffers) right now are unmovable so they don't fit
well into memory power management.

Best regards
Michał Nazarewicz June 15, 2011, 9:26 a.m. UTC | #19
On Tue, 14 Jun 2011 22:42:24 +0200, Arnd Bergmann <arnd@arndb.de> wrote:
> * We still need to solve the same problem in case of IOMMU mappings
>   at some point, even if today's hardware doesn't have this combination.
>   It would be good to use the same solution for both.

I don't think I follow.  What does IOMMU has to do with CMA?
Arnd Bergmann June 15, 2011, 11:14 a.m. UTC | #20
On Wednesday 15 June 2011, Marek Szyprowski wrote:

> > > If possible I would like to make cma something similar to
> > > declare_dma_coherent()&friends, so the board/platform/bus startup code
> > > will just call declare_dma_contiguous() to enable support for cma for
> > > particular devices.
> > 
> > Sounds good, I like that.
>  
> Thanks. I thought a bit more on this and decided that I want to make this
> declare_dma_contiguous() optional for the drivers. It should be used only
> for some sophisticated cases like for example our video codec with two
> memory interfaces for 2 separate banks. By default the dma-mapping will
> use system-wide cma pool.
> 
> (snipped)

Ok, good.

> > Please explain the exact requirements that lead you to defining multiple
> > contexts. My feeling is that in most cases we don't need them and can
> > simply live with a single area. Depending on how obscure the cases are
> > where we do need something beyond that, we can then come up with
> > special-case solutions for them.
> 
> Like it was already stated we need such feature for our multimedia codec
> to allocate buffers from different memory banks. I really don't see any
> problems with the possibility to have additional cma areas for special
> purposes.

It's not a problem for special cases, I just want to make sure that
the common case works well enough that we don't need too many special
cases.

> > The problem with defining CMA areas in the device tree is that it's not
> > a property of the hardware, but really policy. The device tree file
> > should not contain anything related to a specific operating system
> > because you might want to boot something on the board that doesn't
> > know about CMA, and even when you only care about using Linux, the
> > implementation might change to the point where hardcoded CMA areas
> > no longer make sense.
> 
> I really doubt that the device tree will carry only system-independent
> information.

So far, this has worked well enough.

> Anyway, the preferred or required memory areas/banks for
> buffer allocation is something that is a property of the hardware not
> the OS policy.

That is true. If there are hard requirements of the hardware, we
can and should definitely document them in the device tree. It's
totally fine to describe the layout of memory banks and affinity
of devices to specific banks where that exists in hardware.

The part that should not be in the device tree is a specific location
of a buffer inside the bank when there is no hardware reason for
the location.

I guess it's also fine to describe requirements per device regarding
how much contiguous memory it will need to operate, if you can provide
that number independent of the application you want to run. The early
boot code can walk the device tree and ensure that the region is
large enough.

> > IMHO we should first aim for a solution that works almost everywhere
> > without the kernel knowing what board it's running on, and then we
> > can add quirks for devices that have special requirements. I think
> > the situation is similar to the vmalloc virtual address space, which
> > normally has a hardcoded size that works almost everywhere, but there
> > are certain drivers etc that require much more, or there are situations
> > where you want to make it smaller in order to avoid highmem.
> 
> I'm trying to create something that will fulfill the requirements of my
> hardware, that's why I cannot focus on a generic case only.

Yes, that is the obvious conflict. My main interest is to have something
that works not just for your hardware but works as good as possible on
the widest possible range of hardware, which may mean it may run not quite
as good in some of the cases.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann June 15, 2011, 11:20 a.m. UTC | #21
On Wednesday 15 June 2011, Michal Nazarewicz wrote:
> On Tue, 14 Jun 2011 22:42:24 +0200, Arnd Bergmann <arnd@arndb.de> wrote:
> > * We still need to solve the same problem in case of IOMMU mappings
> >   at some point, even if today's hardware doesn't have this combination.
> >   It would be good to use the same solution for both.
> 
> I don't think I follow.  What does IOMMU has to do with CMA?

The point is that on the higher level device drivers, we want to
hide the presence of CMA and/or IOMMU behind the dma mapping API,
but the device drivers do need to know about the bank properties.

If we want to solve the problem of allocating per-bank memory inside
of CMA, we also need to solve it inside of the IOMMU code, using
the same device driver interface.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann June 15, 2011, 11:27 a.m. UTC | #22
On Tuesday 14 June 2011, Jordan Crouse wrote:
> 
> On 06/14/2011 02:42 PM, Arnd Bergmann wrote:
> > On Tuesday 14 June 2011 20:58:25 Zach Pfeffer wrote:
> >> I've seen this split bank allocation in Qualcomm and TI SoCs, with
> >> Samsung, that makes 3 major SoC vendors (I would be surprised if
> >> Nvidia didn't also need to do this) - so I think some configurable
> >> method to control allocations is necessarily. The chips can't do
> >> decode without it (and by can't do I mean 1080P and higher decode is
> >> not functionally useful). Far from special, this would appear to be
> >> the default.
> >
> > Thanks for the insight, that's a much better argument than 'something
> > may need it'. Are those all chips without an IOMMU or do we also
> > need to solve the IOMMU case with split bank allocation?
> 
> Yes. The IOMMU case with split bank allocation is key, especially for shared
> buffers. Consider the case where video is using a certain bank for performance
> purposes and that frame is shared with the GPU.

Could we use the non-uniform memory access (NUMA) code for this? That code
does more than what we've been talking about, and we're currently thinking
only of a degenerate case (one CPU node with multiple memory nodes), but my
feeling is that we can still build on top of it.

The NUMA code can describe relations between different areas of memory
and how they interact with devices and processes, so you can attach a
device to a specific node and have all allocations done from there.
You can also set policy in user space, e.g. to have a video decoder
process running on the bank that is not used by the GPU.

In the DMA mapping API, that would mean we add another dma_attr to
dma_alloc_* that lets you pass a node identifier.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michał Nazarewicz June 15, 2011, 11:30 a.m. UTC | #23
On Wed, 15 Jun 2011 13:20:42 +0200, Arnd Bergmann <arnd@arndb.de> wrote:
> The point is that on the higher level device drivers, we want to
> hide the presence of CMA and/or IOMMU behind the dma mapping API,
> but the device drivers do need to know about the bank properties.

Gotcha, thanks.
Daniel Vetter June 15, 2011, 11:53 a.m. UTC | #24
On Tue, Jun 14, 2011 at 20:30, Arnd Bergmann <arnd@arndb.de> wrote:
> On Tuesday 14 June 2011 18:58:35 Michal Nazarewicz wrote:
>> Ah yes, I forgot that separate regions for different purposes could
>> decrease fragmentation.
>
> That is indeed a good point, but having a good allocator algorithm
> could also solve this. I don't know too much about these allocation
> algorithms, but there are probably multiple working approaches to this.

imo no allocator algorithm is gonna help if you have comparably large,
variable-sized contiguous allocations out of a restricted address range.
It might work well enough if there are only a few sizes and/or there's
decent headroom. But for really generic workloads this would require
sync objects and eviction callbacks (i.e. what Thomas Hellstrom pushed
with ttm).

So if this is only a requirement on very few platforms and can be
cheaply fixed with multiple cma allocation areas (heck, we have
slabs for the same reasons in the kernel), it might be a sensible
compromise.
-Daniel
Thomas Hellstrom June 15, 2011, 1:12 p.m. UTC | #25
On 06/15/2011 01:53 PM, Daniel Vetter wrote:
> On Tue, Jun 14, 2011 at 20:30, Arnd Bergmann<arnd@arndb.de>  wrote:
>    
>> On Tuesday 14 June 2011 18:58:35 Michal Nazarewicz wrote:
>>      
>>> Ah yes, I forgot that separate regions for different purposes could
>>> decrease fragmentation.
>>>        
>> That is indeed a good point, but having a good allocator algorithm
>> could also solve this. I don't know too much about these allocation
>> algorithms, but there are probably multiple working approaches to this.
>>      
> imo no allocator algorithm is gonna help if you have comparably large,
> variable-sized contiguous allocations out of a restricted address range.
> It might work well enough if there are only a few sizes and/or there's
> decent headroom. But for really generic workloads this would require
> sync objects and eviction callbacks (i.e. what Thomas Hellstrom pushed
> with ttm).
>    

Indeed, IIRC on the meeting I pointed out that there is no way to 
generically solve the fragmentation problem without movable buffers. 
(I'd do it as a simple CMA backend to TTM). This is exactly the same 
problem as trying to fit buffers in a limited VRAM area.

/Thomas


--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Larry Bassel June 15, 2011, 9:39 p.m. UTC | #26
On 15 Jun 11 10:36, Marek Szyprowski wrote:
> Hello,
> 
> On Tuesday, June 14, 2011 10:42 PM Arnd Bergmann wrote:
> 
> > On Tuesday 14 June 2011 20:58:25 Zach Pfeffer wrote:
> > > I've seen this split bank allocation in Qualcomm and TI SoCs, with
> > > Samsung, that makes 3 major SoC vendors (I would be surprised if
> > > Nvidia didn't also need to do this) - so I think some configurable
> > > method to control allocations is necessarily. The chips can't do
> > > decode without it (and by can't do I mean 1080P and higher decode is
> > > not functionally useful). Far from special, this would appear to be
> > > the default.

We at Qualcomm have some platforms that have memory of different
performance characteristics, some drivers will need a way of
specifying that they need fast memory for an allocation (and would prefer
an error if it is not available rather than a fallback to slower
memory). It would also be bad if allocators who don't need fast
memory got it "accidentally", depriving those who really need it.

> > 
> > Thanks for the insight, that's a much better argument than 'something
> > may need it'. Are those all chips without an IOMMU or do we also
> > need to solve the IOMMU case with split bank allocation?
> > 
> > I think I'd still prefer to see the support for multiple regions split
> > out into one of the later patches, especially since that would defer
> > the question of how to do the initialization for this case and make
> > sure we first get a generic way.
> > 
> > You've convinced me that we need to solve the problem of allocating
> > memory from a specific bank eventually, but separating it from the
> > one at hand (contiguous allocation) should help getting the important
> > groundwork in at first.
> >
> > The possible conflict that I still see with per-bank CMA regions are:
> > 
> > * It completely destroys memory power management in cases where that
> >   is based on powering down entire memory banks.
> 
> I don't think that per-bank CMA regions destroys memory power management
> more than the global CMA pool. Please note that the contiguous buffers
> (or in general dma-buffers) right now are unmovable so they don't fit
> well into memory power management.

We also have platforms where a well-defined part of the memory
can be powered off, and other parts can't (or won't). We need a way
to steer the place allocations come from to the memory that won't be
turned off (so that CMA allocations are not an obstacle to memory
hotremove).

> 
> Best regards
> -- 
> Marek Szyprowski
> Samsung Poland R&D Center
> 
> 
> 
> _______________________________________________
> Linaro-mm-sig mailing list
> Linaro-mm-sig@lists.linaro.org
> http://lists.linaro.org/mailman/listinfo/linaro-mm-sig

Larry Bassel
Arnd Bergmann June 15, 2011, 10:06 p.m. UTC | #27
On Wednesday 15 June 2011 23:39:58 Larry Bassel wrote:
> On 15 Jun 11 10:36, Marek Szyprowski wrote:
> > On Tuesday, June 14, 2011 10:42 PM Arnd Bergmann wrote:
> > 
> > > On Tuesday 14 June 2011 20:58:25 Zach Pfeffer wrote:
> > > > I've seen this split bank allocation in Qualcomm and TI SoCs, with
> > > > Samsung, that makes 3 major SoC vendors (I would be surprised if
> > > > Nvidia didn't also need to do this) - so I think some configurable
> > > > method to control allocations is necessarily. The chips can't do
> > > > decode without it (and by can't do I mean 1080P and higher decode is
> > > > not functionally useful). Far from special, this would appear to be
> > > > the default.
> 
> We at Qualcomm have some platforms that have memory of different
> performance characteristics, some drivers will need a way of
> specifying that they need fast memory for an allocation (and would prefer
> an error if it is not available rather than a fallback to slower
> memory). It would also be bad if allocators who don't need fast
> memory got it "accidentally", depriving those who really need it.

Can you describe how the memory areas differ specifically?
Is there one that is always faster but very small, or are there
just specific circumstances under which some memory is faster than
another?

> > > The possible conflict that I still see with per-bank CMA regions are:
> > > 
> > > * It completely destroys memory power management in cases where that
> > >   is based on powering down entire memory banks.
> > 
> > I don't think that per-bank CMA regions destroys memory power management
> > more than the global CMA pool. Please note that the contiguous buffers
> > (or in general dma-buffers) right now are unmovable so they don't fit
> > well into memory power management.
> 
> We also have platforms where a well-defined part of the memory
> can be powered off, and other parts can't (or won't). We need a way
> to steer the place allocations come from to the memory that won't be
> turned off (so that CMA allocations are not an obstacle to memory
> hotremove).

We already established that we have to know something about the banks,
and your additional input makes it even clearer that we need to consider
the bigger picture here: We need to describe parts of memory separately
regarding general performance, device specific allocations and hotplug
characteristics.

It still sounds to me that this can be done using the NUMA properties
that Linux already understands, and teaching more subsystems about it,
but maybe the memory hotplug developers have already come up with
another scheme. The way that memory hotplug and CMA choose their
memory regions certainly needs to take both into account. As far as
I can see there are both conflicting and synergistic effects when
you combine the two.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Philip Balister June 16, 2011, 12:48 a.m. UTC | #28
On 06/15/2011 12:37 AM, Arnd Bergmann wrote:
> On Wednesday 15 June 2011 09:11:39 Marek Szyprowski wrote:
>> I see your concerns, but I really wonder how to determine the properties
>> of the global/default cma pool. You definitely don't want to give all
>> available memory o CMA, because it will have negative impact on kernel
>> operation (kernel really needs to allocate unmovable pages from time to
>> time).
>
> Exactly. This is a hard problem, so I would prefer to see a solution for
> coming up with reasonable defaults.

Is this a situation where passing the information from device tree might 
help? I know this does not help short term, but I am trying to 
understand the sorts of problems device tree can help solve.

Philip

>
>> The only solution I see now is to provide Kconfig entry to determine
>> the size of the global CMA pool, but this still have some issues,
>> especially for multi-board kernels (each board probably will have
>> different amount of RAM and different memory-consuming devices
>> available). It looks that each board startup code still might need to
>> tweak the size of CMA pool. I can add a kernel command line option for
>> it, but such solution also will not solve all the cases (afair there
>> was a discussion about kernel command line parameters for memory
>> configuration and the conclusion was that it should be avoided).
>
> The command line option can be a last resort if the heuristics fail,
> but it's not much better than a fixed Kconfig setting.
>
> How about a Kconfig option that defines the percentage of memory
> to set aside for contiguous allocations?
>
> 	Arnd
>
> _______________________________________________
> Linaro-mm-sig mailing list
> Linaro-mm-sig@lists.linaro.org
> http://lists.linaro.org/mailman/listinfo/linaro-mm-sig
>
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zach Pfeffer June 16, 2011, 3:20 a.m. UTC | #29
On 15 June 2011 16:39, Larry Bassel <lbassel@codeaurora.org> wrote:
> On 15 Jun 11 10:36, Marek Szyprowski wrote:
>> Hello,
>>
>> On Tuesday, June 14, 2011 10:42 PM Arnd Bergmann wrote:
>>
>> > On Tuesday 14 June 2011 20:58:25 Zach Pfeffer wrote:
>> > > I've seen this split bank allocation in Qualcomm and TI SoCs, with
>> > > Samsung, that makes 3 major SoC vendors (I would be surprised if
>> > > Nvidia didn't also need to do this) - so I think some configurable
>> > > method to control allocations is necessarily. The chips can't do
>> > > decode without it (and by can't do I mean 1080P and higher decode is
>> > > not functionally useful). Far from special, this would appear to be
>> > > the default.
>
> We at Qualcomm have some platforms that have memory of different
> performance characteristics, some drivers will need a way of
> specifying that they need fast memory for an allocation (and would prefer
> an error if it is not available rather than a fallback to slower
> memory). It would also be bad if allocators who don't need fast
> memory got it "accidentally", depriving those who really need it.

I think this statement actually applies to all the SoCs that are
coming out now and in the future from TI, Samsung, Nvidia, Freescale,
ST Ericsson and others. It seems that in all cases users will want to:

1. Allocate memory with a per-SoC physical memory mapping policy that
is usually manually specified, i.e. use this physical memory bank set
for this allocation or nothing.
2. Be able to easily pass a token to this memory between various
userspace processes and the kernel.
3. Be able to easily and explicitly access attributes of an allocation
from all contexts.
4. Be able to save and reload this memory without giving up the
virtual address allocation.

In essence they want a architectural independent map object that can
bounce around the system with a unique handle.
>> >
>> > Thanks for the insight, that's a much better argument than 'something
>> > may need it'. Are those all chips without an IOMMU or do we also
>> > need to solve the IOMMU case with split bank allocation?
>> >
>> > I think I'd still prefer to see the support for multiple regions split
>> > out into one of the later patches, especially since that would defer
>> > the question of how to do the initialization for this case and make
>> > sure we first get a generic way.
>> >
>> > You've convinced me that we need to solve the problem of allocating
>> > memory from a specific bank eventually, but separating it from the
>> > one at hand (contiguous allocation) should help getting the important
>> > groundwork in at first.
>> >
>> > The possible conflict that I still see with per-bank CMA regions are:
>> >
>> > * It completely destroys memory power management in cases where that
>> >   is based on powering down entire memory banks.
>>
>> I don't think that per-bank CMA regions destroys memory power management
>> more than the global CMA pool. Please note that the contiguous buffers
>> (or in general dma-buffers) right now are unmovable so they don't fit
>> well into memory power management.
>
> We also have platforms where a well-defined part of the memory
> can be powered off, and other parts can't (or won't). We need a way
> to steer the place allocations come from to the memory that won't be
> turned off (so that CMA allocations are not an obstacle to memory
> hotremove).
>
>>
>> Best regards
>> --
>> Marek Szyprowski
>> Samsung Poland R&D Center
>>
>>
>>
>> _______________________________________________
>> Linaro-mm-sig mailing list
>> Linaro-mm-sig@lists.linaro.org
>> http://lists.linaro.org/mailman/listinfo/linaro-mm-sig
>
> Larry Bassel
>
> --
> Sent by an employee of the Qualcomm Innovation Center, Inc.
> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann June 16, 2011, 7:03 a.m. UTC | #30
On Thursday 16 June 2011 02:48:12 Philip Balister wrote:
> On 06/15/2011 12:37 AM, Arnd Bergmann wrote:
> > On Wednesday 15 June 2011 09:11:39 Marek Szyprowski wrote:
> >> I see your concerns, but I really wonder how to determine the properties
> >> of the global/default cma pool. You definitely don't want to give all
> >> available memory o CMA, because it will have negative impact on kernel
> >> operation (kernel really needs to allocate unmovable pages from time to
> >> time).
> >
> > Exactly. This is a hard problem, so I would prefer to see a solution for
> > coming up with reasonable defaults.
> 
> Is this a situation where passing the information from device tree might 
> help? I know this does not help short term, but I am trying to 
> understand the sorts of problems device tree can help solve.

The device tree is a good place to describe any hardware properties such
as 'this device will need 32 MB of contiguous allocations on the memory
bank described in that other device node'.

It is however not a good place to describe user settings such as 'I want
to give this device a 200 MB pool for large allocations so I can run
application X efficiently', because that would require knowledge in the
boot loader about local policy, which it should generally not care about.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Larry Bassel June 16, 2011, 5:01 p.m. UTC | #31
On 16 Jun 11 00:06, Arnd Bergmann wrote:
> On Wednesday 15 June 2011 23:39:58 Larry Bassel wrote:
> > On 15 Jun 11 10:36, Marek Szyprowski wrote:
> > > On Tuesday, June 14, 2011 10:42 PM Arnd Bergmann wrote:
> > > 
> > > > On Tuesday 14 June 2011 20:58:25 Zach Pfeffer wrote:
> > > > > I've seen this split bank allocation in Qualcomm and TI SoCs, with
> > > > > Samsung, that makes 3 major SoC vendors (I would be surprised if
> > > > > Nvidia didn't also need to do this) - so I think some configurable
> > > > > method to control allocations is necessarily. The chips can't do
> > > > > decode without it (and by can't do I mean 1080P and higher decode is
> > > > > not functionally useful). Far from special, this would appear to be
> > > > > the default.
> > 
> > We at Qualcomm have some platforms that have memory of different
> > performance characteristics, some drivers will need a way of
> > specifying that they need fast memory for an allocation (and would prefer
> > an error if it is not available rather than a fallback to slower
> > memory). It would also be bad if allocators who don't need fast
> > memory got it "accidentally", depriving those who really need it.
> 
> Can you describe how the memory areas differ specifically?
> Is there one that is always faster but very small, or are there
> just specific circumstances under which some memory is faster than
> another?

One is always faster, but very small (generally 2-10% the size
of "normal" memory).

Larry
Arnd Bergmann June 17, 2011, 12:45 p.m. UTC | #32
On Thursday 16 June 2011 19:01:33 Larry Bassel wrote:
> > Can you describe how the memory areas differ specifically?
> > Is there one that is always faster but very small, or are there
> > just specific circumstances under which some memory is faster than
> > another?
> 
> One is always faster, but very small (generally 2-10% the size
> of "normal" memory).
> 

Ok, that sounds like the "SRAM" regions that we are handling on some
ARM platforms using the various interfaces. It should probably
remain outside of the regular allocator, but we can try to generalize
the SRAM support further. There are many possible uses for it.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann June 17, 2011, 4:08 p.m. UTC | #33
On Wednesday 15 June 2011, Daniel Vetter wrote:
> On Tue, Jun 14, 2011 at 20:30, Arnd Bergmann <arnd@arndb.de> wrote:
> > On Tuesday 14 June 2011 18:58:35 Michal Nazarewicz wrote:
> >> Ah yes, I forgot that separate regions for different purposes could
> >> decrease fragmentation.
> >
> > That is indeed a good point, but having a good allocator algorithm
> > could also solve this. I don't know too much about these allocation
> > algorithms, but there are probably multiple working approaches to this.
> 
> imo no allocator algorithm is gonna help if you have comparably large,
> variable-sized contiguous allocations out of a restricted address range.
> It might work well enough if there are only a few sizes and/or there's
> decent headroom. But for really generic workloads this would require
> sync objects and eviction callbacks (i.e. what Thomas Hellstrom pushed
> with ttm).

The requirements are quite different depending on what system you
look at. In a lot of cases, the constraints are not that tight at all,
and CMA will easily help to turn "works sometimes" into "works almost
always". Let's get there first and then look into the harder problems.

Unfortunately, memory allocation gets nondeterministic in the corner
cases, you can simply get the system into a state where you don't
have enough memory when you try to do too many things at once. This
may sound like a platitude but it's really what is behind all this:

If we had unlimited amounts of RAM, we would never need CMA, we could
simply set aside a lot of memory at boot time. Having one CMA area
with movable page eviction lets you build systems capable of doing
the same thing with less RAM than without CMA. Adding more complexity
lets you reduce that amount further.

The other aspects that have been mentioned about bank affinity and
SRAM are pretty orthogonal to the allocation, so we should also
treat them separately.

> So if this is only a requirement on very few platforms and can be
> cheaply fixed with multiple cma allocation areas (heck, we have
> slabs for the same reasons in the kernel), it might be a sensible
> compromise.

Yes, we can probably add it later when we find out what the limits
of the generic approach are. I don't really mind having the per-device
pointers to CMA areas, we just need to come up with a good way to
initialize them.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hans Verkuil June 22, 2011, 7:03 a.m. UTC | #34
On Wednesday, June 15, 2011 09:37:18 Arnd Bergmann wrote:
> On Wednesday 15 June 2011 09:11:39 Marek Szyprowski wrote:
> > I see your concerns, but I really wonder how to determine the properties
> > of the global/default cma pool. You definitely don't want to give all
> > available memory o CMA, because it will have negative impact on kernel
> > operation (kernel really needs to allocate unmovable pages from time to
> > time). 
> 
> Exactly. This is a hard problem, so I would prefer to see a solution for
> coming up with reasonable defaults.
> 
> > The only solution I see now is to provide Kconfig entry to determine
> > the size of the global CMA pool, but this still have some issues,
> > especially for multi-board kernels (each board probably will have
> > different amount of RAM and different memory-consuming devices
> > available). It looks that each board startup code still might need to
> > tweak the size of CMA pool. I can add a kernel command line option for
> > it, but such solution also will not solve all the cases (afair there
> > was a discussion about kernel command line parameters for memory 
> > configuration and the conclusion was that it should be avoided).
> 
> The command line option can be a last resort if the heuristics fail,
> but it's not much better than a fixed Kconfig setting.
> 
> How about a Kconfig option that defines the percentage of memory
> to set aside for contiguous allocations?

I would actually like to see a cma_size kernel option of some sort. This would
be for the global CMA pool only as I don't think we should try to do anything
more complicated here.

While it is relatively easy for embedded systems to do a recompile every time
you need to change the pool size, this isn't an option on 'normal' desktop
systems.

While usually you have more than enough memory on such systems and don't need
CMA, there are a number of cases where you do want to reserve sufficient
memory. Usually these involve lots of video capture cards in one system.

What I was wondering about is how this patch series changes the allocation
in case it can't allocate from the CMA pool. Will it attempt to fall back
to a 'normal' allocation?

The reason I ask is that for desktop systems you could just start with a CMA
pool of size 0. And only in specific situations would you need to add a
cma_size kernel parameter depending on your needs. But this scheme would
require a fallback scenario in case of a global CMA pool of size 0.

Hmm, perhaps this fallback scenario is more driver specific. For SoC platform
video devices you may not want a fallback, whereas for PCI(e)/USB devices you
do. I don't know what's best, frankly.

Regards,

	Hans
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michał Nazarewicz June 22, 2011, 7:32 a.m. UTC | #35
On Wed, 22 Jun 2011 09:03:30 +0200, Hans Verkuil <hverkuil@xs4all.nl>  
wrote:
> What I was wondering about is how this patch series changes the  
> allocation in case it can't allocate from the CMA pool. Will it
> attempt to fall back to a 'normal' allocation?

Unless Marek changed something since I wrote the code, which I doubt,
if CMA cannot obtain memory from CMA region, it will fail.

Part of the reason is that CMA lacks the knowledge where to allocate
memory from.  For instance, with the case of several memory banks,
it does not know which memory bank to allocate from.

It is, in my opinion, a task for a higher level functions (read:
DMA layer) to try another mechanism if CMA fails.
Arnd Bergmann June 22, 2011, 12:42 p.m. UTC | #36
On Wednesday 22 June 2011, Hans Verkuil wrote:
> > How about a Kconfig option that defines the percentage of memory
> > to set aside for contiguous allocations?
> 
> I would actually like to see a cma_size kernel option of some sort. This would
> be for the global CMA pool only as I don't think we should try to do anything
> more complicated here.

A command line is probably good to override the compile-time default, yes.

We could also go further and add a runtime sysctl mechanism like the one
for hugepages, where you can grow the pool at run time as long as there is
enough free contiguous memory (e.g. from init scripts), or shrink it later
if you want to allow larger nonmovable allocations.

My feeling is that we need to find a way to integrate the global settings
for four kinds of allocations:

* nonmovable kernel pages
* hugetlb pages
* CMA
* memory hotplug

These essentially fight over the same memory (though things are slightly
different with dynamic hugepages), and they all face the same basic problem
of getting as much for themselves without starving the other three.

	Arnd

--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marek Szyprowski June 22, 2011, 1:15 p.m. UTC | #37
Hello,

On Wednesday, June 22, 2011 2:42 PM Arnd Bergmann wrote:

> On Wednesday 22 June 2011, Hans Verkuil wrote:
> > > How about a Kconfig option that defines the percentage of memory
> > > to set aside for contiguous allocations?
> >
> > I would actually like to see a cma_size kernel option of some sort. This
> would
> > be for the global CMA pool only as I don't think we should try to do
> anything
> > more complicated here.
> 
> A command line is probably good to override the compile-time default, yes.
> 
> We could also go further and add a runtime sysctl mechanism like the one
> for hugepages, where you can grow the pool at run time as long as there is
> enough free contiguous memory (e.g. from init scripts), or shrink it later
> if you want to allow larger nonmovable allocations.

Sounds really good, but it might be really hard to implemnt, at least for
CMA, because it needs to tweak parameters of memory management internal 
structures very early, when buddy allocator has not been activated yet.

> My feeling is that we need to find a way to integrate the global settings
> for four kinds of allocations:
> 
> * nonmovable kernel pages
> * hugetlb pages
> * CMA
> * memory hotplug
> 
> These essentially fight over the same memory (though things are slightly
> different with dynamic hugepages), and they all face the same basic problem
> of getting as much for themselves without starving the other three.

I'm not sure we can solve all such issues in the first version. Maybe we should
first have each of the above fully working in mainline separately and then
start the integration works.

Best regards
Arnd Bergmann June 22, 2011, 1:39 p.m. UTC | #38
On Wednesday 22 June 2011, Marek Szyprowski wrote:
> Sounds really good, but it might be really hard to implemnt, at least for
> CMA, because it needs to tweak parameters of memory management internal 
> structures very early, when buddy allocator has not been activated yet.

Why that? I would expect you can do the same that hugepages (used to) do
and just attempt high-order allocations. If they succeed, you can add them
as a CMA region and free them again, into the movable set of pages, otherwise
you just fail the  request from user space when the memory is already
fragmented.
 
> > These essentially fight over the same memory (though things are slightly
> > different with dynamic hugepages), and they all face the same basic problem
> > of getting as much for themselves without starving the other three.
> 
> I'm not sure we can solve all such issues in the first version. Maybe we should
> first have each of the above fully working in mainline separately and then
> start the integration works.

Yes, makes sense. We just need to be careful not to introduce user-visible
interfaces that we cannot change any more in the process.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michał Nazarewicz June 22, 2011, 3:54 p.m. UTC | #39
> On Wednesday, June 22, 2011 2:42 PM Arnd Bergmann wrote:
>> We could also go further and add a runtime sysctl mechanism like the
>> one for hugepages, where you can grow the pool at run time as long
>> as there is enough free contiguous memory (e.g. from init scripts),
>> or shrink it later if you want to allow larger nonmovable allocations.

On Wed, 22 Jun 2011 15:15:35 +0200, Marek Szyprowski wrote:
> Sounds really good, but it might be really hard to implement, at
> least for CMA, because it needs to tweak parameters of memory
> management internal structures very early, when buddy allocator
> has not been activated yet.

If you are able to allocate a pageblock of free memory from buddy system,
you should be able to convert it to CMA memory with no problems.

Also, if you want to convert CMA memory back to regular memory you
should be able to do that even if some of the memory is used by CMA
(it just won't be available right away but only when CMA frees it).

It is important to note that, because of the use of migration type,
all such conversion have to be performed on pageblock basis.

I don't think this is a feature we should consider for the first patch
though.  We started with an overgrown idea about what CMA might do
and it didn't got us far.  Let's first get the basics right and
then start implementing features as they become needed.
Michał Nazarewicz June 22, 2011, 4:04 p.m. UTC | #40
On Wed, 22 Jun 2011 15:39:23 +0200, Arnd Bergmann <arnd@arndb.de> wrote:
> Why that? I would expect you can do the same that hugepages (used to) do
> and just attempt high-order allocations. If they succeed, you can add  
> them as a CMA region and free them again, into the movable set of pages,  
> otherwise you just fail the  request from user space when the memory is
> already fragmented.

Problem with that is that CMA needs to have whole pageblocks allocated
and buddy can allocate at most half a pageblock.
Ankita Garg July 4, 2011, 5:25 a.m. UTC | #41
Hi,

On Thu, Jun 16, 2011 at 12:06:07AM +0200, Arnd Bergmann wrote:
> On Wednesday 15 June 2011 23:39:58 Larry Bassel wrote:
> > On 15 Jun 11 10:36, Marek Szyprowski wrote:
> > > On Tuesday, June 14, 2011 10:42 PM Arnd Bergmann wrote:
> > > 
> > > > On Tuesday 14 June 2011 20:58:25 Zach Pfeffer wrote:
> > > > > I've seen this split bank allocation in Qualcomm and TI SoCs, with
> > > > > Samsung, that makes 3 major SoC vendors (I would be surprised if
> > > > > Nvidia didn't also need to do this) - so I think some configurable
> > > > > method to control allocations is necessarily. The chips can't do
> > > > > decode without it (and by can't do I mean 1080P and higher decode is
> > > > > not functionally useful). Far from special, this would appear to be
> > > > > the default.
> > 
> > We at Qualcomm have some platforms that have memory of different
> > performance characteristics, some drivers will need a way of
> > specifying that they need fast memory for an allocation (and would prefer
> > an error if it is not available rather than a fallback to slower
> > memory). It would also be bad if allocators who don't need fast
> > memory got it "accidentally", depriving those who really need it.
> 
> Can you describe how the memory areas differ specifically?
> Is there one that is always faster but very small, or are there
> just specific circumstances under which some memory is faster than
> another?
> 
> > > > The possible conflict that I still see with per-bank CMA regions are:
> > > > 
> > > > * It completely destroys memory power management in cases where that
> > > >   is based on powering down entire memory banks.
> > > 
> We already established that we have to know something about the banks,
> and your additional input makes it even clearer that we need to consider
> the bigger picture here: We need to describe parts of memory separately
> regarding general performance, device specific allocations and hotplug
> characteristics.
> 
> It still sounds to me that this can be done using the NUMA properties
> that Linux already understands, and teaching more subsystems about it,
> but maybe the memory hotplug developers have already come up with
> another scheme. The way that memory hotplug and CMA choose their
> memory regions certainly needs to take both into account. As far as
> I can see there are both conflicting and synergistic effects when
> you combine the two.
> 

Recently, we proposed a generic 'memory regions' framework to exploit
the memory power management capabilities on the embedded boards. Think
of some of the above CMA requirements could be met by this fraemwork.
One of the main goals of regions is to make the VM aware of the hardware
memory boundaries, like bank. For managing memory power consumption,
memory regions are created aligned to the hardware granularity at which
the power can be managed (ie, the memory power consumption operations
like on/off can be performed). If attributed are associated with each of
these regions, some of these regions could be marked as CMA-only,
ensuring that only movable and per-bank memory is allocated. More
details on the design can be found here:

http://lkml.org/lkml/2011/5/27/177
http://lkml.org/lkml/2011/6/29/202
http://lwn.net/Articles/446493/
Arnd Bergmann July 4, 2011, 2:45 p.m. UTC | #42
On Monday 04 July 2011, Ankita Garg wrote:
> > It still sounds to me that this can be done using the NUMA properties
> > that Linux already understands, and teaching more subsystems about it,
> > but maybe the memory hotplug developers have already come up with
> > another scheme. The way that memory hotplug and CMA choose their
> > memory regions certainly needs to take both into account. As far as
> > I can see there are both conflicting and synergistic effects when
> > you combine the two.
> > 
> 
> Recently, we proposed a generic 'memory regions' framework to exploit
> the memory power management capabilities on the embedded boards. Think
> of some of the above CMA requirements could be met by this fraemwork.
> One of the main goals of regions is to make the VM aware of the hardware
> memory boundaries, like bank. For managing memory power consumption,
> memory regions are created aligned to the hardware granularity at which
> the power can be managed (ie, the memory power consumption operations
> like on/off can be performed). If attributed are associated with each of
> these regions, some of these regions could be marked as CMA-only,
> ensuring that only movable and per-bank memory is allocated. More
> details on the design can be found here:
> 
> http://lkml.org/lkml/2011/5/27/177
> http://lkml.org/lkml/2011/6/29/202
> http://lwn.net/Articles/446493/

Thanks for the pointers, that is exactly what I was looking for.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/cma.h b/include/linux/cma.h
new file mode 100644
index 0000000..70a993c
--- /dev/null
+++ b/include/linux/cma.h
@@ -0,0 +1,189 @@ 
+#ifndef __LINUX_CMA_H
+#define __LINUX_CMA_H
+
+/*
+ * Contiguous Memory Allocator
+ * Copyright (c) 2010-2011 by Samsung Electronics.
+ * Written by:
+ *	Michal Nazarewicz <mina86@mina86.com>
+ *	Marek Szyprowski <m.szyprowski@samsung.com>
+ */
+
+/*
+ * Contiguous Memory Allocator
+ *
+ *   The Contiguous Memory Allocator (CMA) makes it possible for
+ *   device drivers to allocate big contiguous chunks of memory after
+ *   the system has booted.
+ *
+ *   It requires some machine- and/or platform-specific initialisation
+ *   code which prepares memory ranges to be used with CMA and later,
+ *   device drivers can allocate memory from those ranges.
+ *
+ * Why is it needed?
+ *
+ *   Various devices on embedded systems have no scatter-getter and/or
+ *   IO map support and require contiguous blocks of memory to
+ *   operate.  They include devices such as cameras, hardware video
+ *   coders, etc.
+ *
+ *   Such devices often require big memory buffers (a full HD frame
+ *   is, for instance, more then 2 mega pixels large, i.e. more than 6
+ *   MB of memory), which makes mechanisms such as kmalloc() or
+ *   alloc_page() ineffective.
+ *
+ *   At the same time, a solution where a big memory region is
+ *   reserved for a device is suboptimal since often more memory is
+ *   reserved then strictly required and, moreover, the memory is
+ *   inaccessible to page system even if device drivers don't use it.
+ *
+ *   CMA tries to solve this issue by operating on memory regions
+ *   where only movable pages can be allocated from.  This way, kernel
+ *   can use the memory for pagecache and when device driver requests
+ *   it, allocated pages can be migrated.
+ *
+ * Driver usage
+ *
+ *   CMA should not be used directly by the device drivers. It should
+ *   be considered as helper framework for dma-mapping subsystm and
+ *   respective (platform)bus drivers.
+ *
+ *   The CMA client needs to have a pointer to a CMA context
+ *   represented by a struct cma (which is an opaque data type).
+ *
+ *   Once such pointer is obtained, a caller may allocate contiguous
+ *   memory chunk using the following function:
+ *
+ *     cm_alloc()
+ *
+ *   This function returns a pointer to the first struct page which
+ *   represent a contiguous memory chunk.  This pointer
+ *   may be used with the following function:
+ *
+ *     cm_free()    -- frees allocated contiguous memory
+ *
+ * Platform/machine integration
+ *
+ *   CMA context must be created on platform or machine initialisation
+ *   and passed to respective subsystem that will be a client for CMA.
+ *   The latter may be done by a global variable or some filed in
+ *   struct device.  For the former CMA provides the following functions:
+ *
+ *     cma_init_migratetype()
+ *     cma_reserve()
+ *     cma_create()
+ *
+ *   The first one initialises a portion of reserved memory so that it
+ *   can be used with CMA.  The second first tries to reserve memory
+ *   (using memblock) and then initialise it.
+ *
+ *   The cma_reserve() function must be called when memblock is still
+ *   operational and reserving memory with it is still possible.  On
+ *   ARM platform the "reserve" machine callback is a perfect place to
+ *   call it.
+ *
+ *   The last function creates a CMA context on a range of previously
+ *   initialised memory addresses.  Because it uses kmalloc() it needs
+ *   to be called after SLAB is initialised.
+ */
+
+/***************************** Kernel level API *****************************/
+
+#ifdef __KERNEL__
+
+struct cma;
+struct page;
+
+#ifdef CONFIG_CMA
+
+/**
+ * cma_init_migratetype() - initialises range of physical memory to be used
+ *		with CMA context.
+ * @start:	start address of the memory range in bytes.
+ * @size:	size of the memory range in bytes.
+ *
+ * The range must be MAX_ORDER_NR_PAGES aligned and it must have been
+ * already reserved (eg. with memblock).
+ *
+ * The actual initialisation is deferred until subsys initcalls are
+ * evaluated (unless this has already happened).
+ *
+ * Returns zero on success or negative error.
+ */
+int cma_init_migratetype(unsigned long start, unsigned long end);
+
+/**
+ * cma_reserve() - reserves memory.
+ * @start:	start address of the memory range in bytes hint; if unsure
+ *		pass zero.
+ * @size:	size of the memory to reserve in bytes.
+ *
+ * It will use memblock to allocate memory. It will also call
+ * cma_init_migratetype() on reserved region so that a CMA context can
+ * be created on given range.
+ *
+ * @start and @size will be aligned to (MAX_ORDER_NR_PAGES << PAGE_SHIFT).
+ *
+ * Returns reserved's area physical address or value that yields true
+ * when checked with IS_ERR_VALUE().
+ */
+unsigned long cma_reserve(unsigned long start, unsigned long size);
+
+/**
+ * cma_create() - creates a CMA context.
+ * @start:	start address of the context in bytes.
+ * @size:	size of the context in bytes.
+ *
+ * The range must be page aligned.  Different contexts cannot overlap.
+ *
+ * The memory range must either lay in ZONE_MOVABLE or must have been
+ * initialised with cma_init_migratetype() function.
+ *
+ * @start and @size must be page aligned.
+ *
+ * Because this function uses kmalloc() it must be called after SLAB
+ * is initialised.  This in particular means that it cannot be called
+ * just after cma_reserve() since the former needs to be run way
+ * earlier.
+ *
+ * Returns pointer to CMA context or a pointer-error on error.
+ */
+struct cma *cma_create(unsigned long start, unsigned long size);
+
+/**
+ * cma_destroy() - destroys CMA context.
+ * @cma:	context to destroy.
+ */
+void cma_destroy(struct cma *cma);
+
+/**
+ * cm_alloc() - allocates contiguous memory.
+ * @cma:	CMA context to use
+ * @count:	desired chunk size in pages (must be non-zero)
+ * @order:	desired alignment in pages
+ *
+ * Returns pointer to first page structure representing contiguous memory
+ * or a pointer-error on error.
+ */
+struct page *cm_alloc(struct cma *cma, int count, unsigned int order);
+
+/**
+ * cm_free() - frees contiguous memory.
+ * @cma:	CMA context to use
+ * @pages:	contiguous memory to free
+ * @count:	desired chunk size in pages (must be non-zero)
+ *
+ */
+void cm_free(struct cma *cma, struct page *pages, int count);
+
+#else
+struct page *cm_alloc(struct cma *cma, int count, unsigned int order)
+{
+	return NULL;
+};
+void cm_free(struct cma *cma, struct page *pages, int count) { }
+#endif
+
+#endif
+
+#endif
diff --git a/mm/Kconfig b/mm/Kconfig
index 6ffedd8..b85de4c 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -346,6 +346,27 @@  choice
 	  benefit.
 endchoice
 
+config CMA
+	bool "Contiguous Memory Allocator"
+	select CMA_MIGRATE_TYPE
+	select MIGRATION
+	select GENERIC_ALLOCATOR
+	help
+	  This enables the Contiguous Memory Allocator which allows drivers
+	  to allocate big physically-contiguous blocks of memory for use with
+	  hardware components that do not support I/O map nor scatter-gather.
+
+	  For more information see <include/linux/cma.h>.  If unsure, say "n".
+
+config CMA_DEBUG
+	bool "CMA debug messages (DEVELOPEMENT)"
+	depends on CMA
+	help
+	  Turns on debug messages in CMA.  This produces KERN_DEBUG
+	  messages for every CMA call as well as various messages while
+	  processing calls such as cm_alloc().  This option does not
+	  affect warning and error messages.
+
 #
 # UP and nommu archs use km based percpu allocator
 #
diff --git a/mm/Makefile b/mm/Makefile
index 836e416..4610320 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -50,3 +50,4 @@  obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
 obj-$(CONFIG_CLEANCACHE) += cleancache.o
+obj-$(CONFIG_CMA) += cma.o
diff --git a/mm/cma.c b/mm/cma.c
new file mode 100644
index 0000000..aee306c
--- /dev/null
+++ b/mm/cma.c
@@ -0,0 +1,358 @@ 
+/*
+ * Contiguous Memory Allocator
+ * Copyright (c) 2010-2011 by Samsung Electronics.
+ * Written by:
+ *	Michal Nazarewicz <mina86@mina86.com>
+ *	Marek Szyprowski <m.szyprowski@samsung.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of the
+ * License or (at your optional) any later version of the license.
+ */
+
+/*
+ * See include/linux/cma.h for details.
+ */
+
+#define pr_fmt(fmt) "cma: " fmt
+
+#ifdef CONFIG_CMA_DEBUG
+#  define DEBUG
+#endif
+
+#include <asm/page.h>
+#include <asm/errno.h>
+
+#include <linux/cma.h>
+#include <linux/memblock.h>
+#include <linux/err.h>
+#include <linux/genalloc.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/page-isolation.h>
+#include <linux/slab.h>
+#include <linux/swap.h>
+#include <linux/mm_types.h>
+
+#include "internal.h"
+
+/* XXX Revisit */
+#ifdef phys_to_pfn
+/* nothing to do */
+#elif defined __phys_to_pfn
+#  define phys_to_pfn __phys_to_pfn
+#else
+#  warning correct phys_to_pfn implementation needed
+static unsigned long phys_to_pfn(phys_addr_t phys)
+{
+	return virt_to_pfn(phys_to_virt(phys));
+}
+#endif
+
+
+/************************* Initialise CMA *************************/
+
+static struct cma_grabbed {
+	unsigned long start;
+	unsigned long size;
+} cma_grabbed[8] __initdata;
+static unsigned cma_grabbed_count __initdata;
+
+#ifdef CONFIG_DEBUG_VM
+
+static int __cma_give_back(unsigned long start, unsigned long size)
+{
+	unsigned long pfn = phys_to_pfn(start);
+	unsigned i = size >> PAGE_SHIFT;
+	struct zone *zone;
+
+	pr_debug("%s(%p+%p)\n", __func__, (void *)start, (void *)size);
+
+	VM_BUG_ON(!pfn_valid(pfn));
+	zone = page_zone(pfn_to_page(pfn));
+
+	do {
+		VM_BUG_ON(!pfn_valid(pfn));
+		VM_BUG_ON(page_zone(pfn_to_page(pfn)) != zone);
+		if (!(pfn & (pageblock_nr_pages - 1)))
+			__free_pageblock_cma(pfn_to_page(pfn));
+		++pfn;
+		++totalram_pages;
+	} while (--i);
+
+	return 0;
+}
+
+#else
+
+static int __cma_give_back(unsigned long start, unsigned long size)
+{
+	unsigned i = size >> (PAGE_SHIFT + pageblock_order);
+	struct page *p = phys_to_page(start);
+
+	pr_debug("%s(%p+%p)\n", __func__, (void *)start, (void *)size);
+
+	do {
+		__free_pageblock_cma(p);
+		p += pageblock_nr_pages;
+		totalram_pages += pageblock_nr_pages;
+	} while (--i);
+
+	return 0;
+}
+
+#endif
+
+static int __init __cma_queue_give_back(unsigned long start, unsigned long size)
+{
+	if (cma_grabbed_count == ARRAY_SIZE(cma_grabbed))
+		return -ENOSPC;
+
+	cma_grabbed[cma_grabbed_count].start = start;
+	cma_grabbed[cma_grabbed_count].size  = size;
+	++cma_grabbed_count;
+	return 0;
+}
+
+static int (*cma_give_back)(unsigned long start, unsigned long size) =
+	__cma_queue_give_back;
+
+static int __init cma_give_back_queued(void)
+{
+	struct cma_grabbed *r = cma_grabbed;
+	unsigned i = cma_grabbed_count;
+
+	pr_debug("%s(): will give %u range(s)\n", __func__, i);
+
+	cma_give_back = __cma_give_back;
+
+	for (; i; --i, ++r)
+		__cma_give_back(r->start, r->size);
+
+	return 0;
+}
+core_initcall(cma_give_back_queued);
+
+int __ref cma_init_migratetype(unsigned long start, unsigned long size)
+{
+	pr_debug("%s(%p+%p)\n", __func__, (void *)start, (void *)size);
+
+	if (!size)
+		return -EINVAL;
+	if ((start | size) & ((MAX_ORDER_NR_PAGES << PAGE_SHIFT) - 1))
+		return -EINVAL;
+	if (start + size < start)
+		return -EOVERFLOW;
+
+	return cma_give_back(start, size);
+}
+
+unsigned long cma_reserve(unsigned long start, unsigned long size)
+{
+	unsigned long alignment;
+	int ret;
+
+	pr_debug("%s(%p+%p)\n", __func__, (void *)start, (void *)size);
+
+	/* Sanity checks */
+	if (!size)
+		return (unsigned long)-EINVAL;
+
+	/* Sanitise input arguments */
+	start = ALIGN(start, MAX_ORDER_NR_PAGES << PAGE_SHIFT);
+	size  = ALIGN(size , MAX_ORDER_NR_PAGES << PAGE_SHIFT);
+	alignment = MAX_ORDER_NR_PAGES << PAGE_SHIFT;
+
+	/* Reserve memory */
+	if (start) {
+		if (memblock_is_region_reserved(start, size) ||
+		    memblock_reserve(start, size) < 0)
+			return (unsigned long)-EBUSY;
+	} else {
+		/*
+		 * Use __memblock_alloc_base() since
+		 * memblock_alloc_base() panic()s.
+		 */
+		u64 addr = __memblock_alloc_base(size, alignment, 0);
+		if (!addr) {
+			return (unsigned long)-ENOMEM;
+		} else if (addr + size > ~(unsigned long)0) {
+			memblock_free(addr, size);
+			return (unsigned long)-EOVERFLOW;
+		} else {
+			start = addr;
+		}
+	}
+
+	/* CMA Initialise */
+	ret = cma_init_migratetype(start, size);
+	if (ret < 0) {
+		memblock_free(start, size);
+		return ret;
+	}
+
+	return start;
+}
+
+
+/************************** CMA context ***************************/
+
+struct cma {
+	int migratetype;
+	struct gen_pool *pool;
+};
+
+static int __cma_check_range(unsigned long start, unsigned long size)
+{
+	int migratetype = MIGRATE_MOVABLE;
+	unsigned long pfn, count;
+	struct page *page;
+	struct zone *zone;
+
+	start = phys_to_pfn(start);
+	if (WARN_ON(!pfn_valid(start)))
+		return -EINVAL;
+
+	if (page_zonenum(pfn_to_page(start)) != ZONE_MOVABLE)
+		migratetype = MIGRATE_CMA;
+
+	/* First check if all pages are valid and in the same zone */
+	zone  = page_zone(pfn_to_page(start));
+	count = size >> PAGE_SHIFT;
+	pfn   = start;
+	while (++pfn, --count) {
+		if (WARN_ON(!pfn_valid(pfn)) ||
+		    WARN_ON(page_zone(pfn_to_page(pfn)) != zone))
+			return -EINVAL;
+	}
+
+	/* Now check migratetype of their pageblocks. */
+	start = start & ~(pageblock_nr_pages - 1);
+	pfn   = ALIGN(pfn, pageblock_nr_pages);
+	page  = pfn_to_page(start);
+	count = (pfn - start) >> PAGE_SHIFT;
+	do {
+		if (WARN_ON(get_pageblock_migratetype(page) != migratetype))
+			return -EINVAL;
+		page += pageblock_nr_pages;
+	} while (--count);
+
+	return migratetype;
+}
+
+struct cma *cma_create(unsigned long start, unsigned long size)
+{
+	struct gen_pool *pool;
+	int migratetype, ret;
+	struct cma *cma;
+
+	pr_debug("%s(%p+%p)\n", __func__, (void *)start, (void *)size);
+
+	if (!size)
+		return ERR_PTR(-EINVAL);
+	if ((start | size) & (PAGE_SIZE - 1))
+		return ERR_PTR(-EINVAL);
+	if (start + size < start)
+		return ERR_PTR(-EOVERFLOW);
+
+	migratetype = __cma_check_range(start, size);
+	if (migratetype < 0)
+		return ERR_PTR(migratetype);
+
+	cma = kmalloc(sizeof *cma, GFP_KERNEL);
+	if (!cma)
+		return ERR_PTR(-ENOMEM);
+
+	pool = gen_pool_create(ffs(PAGE_SIZE) - 1, -1);
+	if (!pool) {
+		ret = -ENOMEM;
+		goto error1;
+	}
+
+	ret = gen_pool_add(pool, start, size, -1);
+	if (unlikely(ret))
+		goto error2;
+
+	cma->migratetype = migratetype;
+	cma->pool = pool;
+
+	pr_debug("%s: returning <%p>\n", __func__, (void *)cma);
+	return cma;
+
+error2:
+	gen_pool_destroy(pool);
+error1:
+	kfree(cma);
+	return ERR_PTR(ret);
+}
+
+void cma_destroy(struct cma *cma)
+{
+	pr_debug("%s(<%p>)\n", __func__, (void *)cma);
+	gen_pool_destroy(cma->pool);
+}
+
+
+/************************* Allocate and free *************************/
+
+/* Protects cm_alloc(), cm_free() as well as gen_pools of each cm. */
+static DEFINE_MUTEX(cma_mutex);
+
+struct page *cm_alloc(struct cma *cma, int count, unsigned int order)
+{
+	unsigned long start;
+	unsigned long size = count << PAGE_SHIFT;
+
+	if (!cma)
+		return NULL;
+
+	pr_debug("%s(<%p>, %lx/%d)\n", __func__, (void *)cma, size, order);
+
+	if (!size)
+		return NULL;
+
+	mutex_lock(&cma_mutex);
+
+	start = gen_pool_alloc_aligned(cma->pool, size, order+12);
+	if (!start)
+		goto error1;
+
+	if (cma->migratetype) {
+		unsigned long pfn = phys_to_pfn(start);
+		int ret = alloc_contig_range(pfn, pfn + (size >> PAGE_SHIFT),
+					     0, cma->migratetype);
+		if (ret)
+			goto error2;
+	}
+
+	mutex_unlock(&cma_mutex);
+
+	pr_debug("%s(): returning [%p]\n", __func__, (void *)phys_to_page(start));
+	return phys_to_page(start);
+error2:
+	gen_pool_free(cma->pool, start, size);
+error1:
+	mutex_unlock(&cma_mutex);
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(cm_alloc);
+
+void cm_free(struct cma *cma, struct page *pages, int count)
+{
+	unsigned long size = count << PAGE_SHIFT;
+	pr_debug("%s([%p])\n", __func__, (void *)pages);
+
+	if (!cma || !pages)
+		return;
+
+	mutex_lock(&cma_mutex);
+
+	gen_pool_free(cma->pool, page_to_phys(pages), size);
+	if (cma->migratetype)
+		free_contig_pages(pages, count);
+
+	mutex_unlock(&cma_mutex);
+}
+EXPORT_SYMBOL_GPL(cm_free);