diff mbox

[RFC,1/8] mm: Introduce memory regions data-structure to capture region boundaries within node

Message ID 20121106195225.6941.2868.stgit@srivatsabhat.in.ibm.com (mailing list archive)
State RFC, archived
Headers show

Commit Message

Srivatsa S. Bhat Nov. 6, 2012, 7:52 p.m. UTC
Within a node, we can have regions of memory that can be power-managed.
That is, chunks of memory can be transitioned (manually or automatically)
to low-power states based on the frequency of references to that region.
For example, if a memory chunk is not referenced for a given threshold
amount of time, the hardware can decide to put that piece of memory into
a content-preserving low-power state. And of course, on the next reference
to that chunk of memory, it will be transitioned to full-power for
read/write operations.

We propose to incorporate this knowledge of power-manageable chunks of
memory into a new data-structure called "Memory Regions". This way of
acknowledging the existence of different classes of memory with different
characteristics is the first step to in order to manage memory
power-efficiently, such as performing power-aware memory allocation etc.

[Also, the concept of memory regions could potentially be extended to work
with different classes of memory like PCM (Phase Change Memory) etc and
hence, it is not limited to just power management alone].

We already sub-divide a node's memory into zones, based on some well-known
constraints. So the question is, where do we fit in memory regions in this
hierarchy. Instead of artificially trying to fit it into the hierarchy one
way or the other, we choose to simply capture the region boundaries in a
parallel data-structure, since there is no guarantee that the region
boundaries will naturally fit inside zone boundaries or vice-versa.

But of course, memory regions are sub-divisions *within* a node, so it makes
sense to keep the data-structures in the node's struct pglist_data. (Thus
this placement makes memory regions parallel to zones in that node).

Once we capture the region boundaries in the memory regions data-structure,
we can influence MM decisions at various places, such as page allocation,
reclamation etc.

Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/mmzone.h |   13 +++++++++++++
 1 file changed, 13 insertions(+)


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

David Hansen Nov. 6, 2012, 11:03 p.m. UTC | #1
On 11/06/2012 11:52 AM, Srivatsa S. Bhat wrote:
> But of course, memory regions are sub-divisions *within* a node, so it makes
> sense to keep the data-structures in the node's struct pglist_data. (Thus
> this placement makes memory regions parallel to zones in that node).

I think it's pretty silly to create *ANOTHER* subdivision of memory
separate from sparsemem.  One that doesn't handle large amounts of
memory or scale with memory hotplug.  As it stands, you can only support
256*512MB=128GB of address space, which seems pretty puny.

This node_regions[]:

> @@ -687,6 +698,8 @@ typedef struct pglist_data {
>  	struct zone node_zones[MAX_NR_ZONES];
>  	struct zonelist node_zonelists[MAX_ZONELISTS];
>  	int nr_zones;
> +	struct node_mem_region node_regions[MAX_NR_REGIONS];
> +	int nr_node_regions;
>  #ifdef CONFIG_FLAT_NODE_MEM_MAP	/* means !SPARSEMEM */
>  	struct page *node_mem_map;
>  #ifdef CONFIG_MEMCG

looks like it's indexed the same way regardless of which node it is in.
 In other words, if there are two nodes, at least half of it is wasted,
and 3/4 if there are four nodes.  That seems a bit suboptimal.

Could you remind us of the logic for leaving sparsemem out of the
equation here?

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srivatsa S. Bhat Nov. 7, 2012, 8:12 p.m. UTC | #2
On 11/07/2012 04:33 AM, Dave Hansen wrote:
> On 11/06/2012 11:52 AM, Srivatsa S. Bhat wrote:
>> But of course, memory regions are sub-divisions *within* a node, so it makes
>> sense to keep the data-structures in the node's struct pglist_data. (Thus
>> this placement makes memory regions parallel to zones in that node).
> 
> I think it's pretty silly to create *ANOTHER* subdivision of memory
> separate from sparsemem.  One that doesn't handle large amounts of
> memory or scale with memory hotplug.  As it stands, you can only support
> 256*512MB=128GB of address space, which seems pretty puny.
> 
> This node_regions[]:
> 
>> @@ -687,6 +698,8 @@ typedef struct pglist_data {
>>  	struct zone node_zones[MAX_NR_ZONES];
>>  	struct zonelist node_zonelists[MAX_ZONELISTS];
>>  	int nr_zones;
>> +	struct node_mem_region node_regions[MAX_NR_REGIONS];
>> +	int nr_node_regions;
>>  #ifdef CONFIG_FLAT_NODE_MEM_MAP	/* means !SPARSEMEM */
>>  	struct page *node_mem_map;
>>  #ifdef CONFIG_MEMCG
> 
> looks like it's indexed the same way regardless of which node it is in.
>  In other words, if there are two nodes, at least half of it is wasted,
> and 3/4 if there are four nodes.  That seems a bit suboptimal.
> 

You're right, I have not addressed that problem in this initial RFC. Thanks
for pointing it out! Going forward, we can surely optimize the way we deal
with memory regions on NUMA systems, using some of the sparsemem techniques.

> Could you remind us of the logic for leaving sparsemem out of the
> equation here?
> 

Nothing, its just that in this first RFC I was more focussed towards getting
the overall design right, in terms of having an acceptable way of tracking
pages belonging to different regions within the page allocator (freelists)
and using it to influence page allocation decisions. And also to compare
the merits of this approach over the previous "Hierarchy" design, in a broad
("big picture") sense.

I'll add the above point you raised in my todo-list and address it in
subsequent versions of the patchset.

Thank you very much for the quick feedback!
 
Regards,
Srivatsa S. Bhat

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 50aaca8..bb7c3ef 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -80,6 +80,8 @@  static inline int get_pageblock_migratetype(struct page *page)
 	return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
 }
 
+#define MAX_NR_REGIONS	256
+
 struct free_area {
 	struct list_head	free_list[MIGRATE_TYPES];
 	unsigned long		nr_free;
@@ -328,6 +330,15 @@  enum zone_type {
 #error ZONES_SHIFT -- too many zones configured adjust calculation
 #endif
 
+struct node_mem_region {
+	unsigned long start_pfn;
+	unsigned long present_pages;
+	unsigned long spanned_pages;
+	int idx;
+	int node;
+	struct pglist_data *pgdat;
+};
+
 struct zone {
 	/* Fields commonly accessed by the page allocator */
 
@@ -687,6 +698,8 @@  typedef struct pglist_data {
 	struct zone node_zones[MAX_NR_ZONES];
 	struct zonelist node_zonelists[MAX_ZONELISTS];
 	int nr_zones;
+	struct node_mem_region node_regions[MAX_NR_REGIONS];
+	int nr_node_regions;
 #ifdef CONFIG_FLAT_NODE_MEM_MAP	/* means !SPARSEMEM */
 	struct page *node_mem_map;
 #ifdef CONFIG_MEMCG