diff mbox

[01/10] mm: Introduce the memory regions data structure

Message ID 1306499498-14263-2-git-send-email-ankita@in.ibm.com (mailing list archive)
State Not Applicable, archived
Headers show

Commit Message

Ankita Garg May 27, 2011, 12:31 p.m. UTC
Memory region data structure is created under a NUMA node. Each NUMA node can
have multiple memory regions, depending upon the platform configuration for
power management. Each memory region contains zones, which is the entity from
which memory is allocated by the buddy allocator.

                 -------------
		 | pg_data_t |
                 -------------
                     |  |
		------  -------
		v             v
        ----------------    ----------------
        | mem_region_t |    | mem_region_t |
        ----------------    ----------------    -------------
               |                    |...........| zone0 | ....
	       v                                -------------
           -----------------------------
           | zone0 | zone1 | zone3 | ..|
           -----------------------------

Each memory region contains a zone array for the zones belonging to that region,
in addition to other fields like node id, index of the region in the node, start
pfn of the pages in that region and the number of pages spanned in the region.
The zone array inside the regions is statically allocated at this point.

ToDo:
However, since the number of regions actually present on the system might be much
smaller than the maximum allowed, dynamic bootmem allocation could be used to save
memory.

Signed-off-by: Ankita Garg <ankita@in.ibm.com>
---
 include/linux/mmzone.h |   25 ++++++++++++++++++++++++-
 1 files changed, 24 insertions(+), 1 deletions(-)

Comments

David Hansen May 27, 2011, 3:30 p.m. UTC | #1
On Fri, 2011-05-27 at 18:01 +0530, Ankita Garg wrote:
> +typedef struct mem_region_list_data {
> +       struct zone zones[MAX_NR_ZONES];
> +       int nr_zones;
> +
> +       int node;
> +       int region;
> +
> +       unsigned long start_pfn;
> +       unsigned long spanned_pages;
> +} mem_region_t;
> +
> +#define MAX_NR_REGIONS    16 

Don't do the foo_t thing.  It's out of style and the pg_data_t is a
dinosaur.

I'm a bit surprised how little discussion of this there is in the patch
descriptions.  Why did you choose this structure?  What are the
downsides of doing it this way?  This effectively breaks up the zone's
LRU in to MAX_NR_REGIONS LRUs.  What effects does that have?

How big _is_ a 'struct zone' these days?  This patch will increase their
effective size by 16x.

Since one distro kernel basically gets run on *EVERYTHING*, what will
MAX_NR_REGIONS be in practice?  How many regions are there on the
largest systems that will need this?  We're going to be doing many
linear searches and iterations over it, so it's pretty darn important to
know.  What does this do to lmbench numbers sensitive to page
allocations?

-- Dave
Vaidyanathan Srinivasan May 27, 2011, 6:20 p.m. UTC | #2
* Dave Hansen <dave@linux.vnet.ibm.com> [2011-05-27 08:30:03]:

> On Fri, 2011-05-27 at 18:01 +0530, Ankita Garg wrote:
> > +typedef struct mem_region_list_data {
> > +       struct zone zones[MAX_NR_ZONES];
> > +       int nr_zones;
> > +
> > +       int node;
> > +       int region;
> > +
> > +       unsigned long start_pfn;
> > +       unsigned long spanned_pages;
> > +} mem_region_t;
> > +
> > +#define MAX_NR_REGIONS    16 
> 
> Don't do the foo_t thing.  It's out of style and the pg_data_t is a
> dinosaur.
> 
> I'm a bit surprised how little discussion of this there is in the patch
> descriptions.  Why did you choose this structure?  What are the
> downsides of doing it this way?  This effectively breaks up the zone's
> LRU in to MAX_NR_REGIONS LRUs.  What effects does that have?

This data structure is one of the option, but definitely has
overheads.  One alternative was to use fake-numa nodes that has more
overhead and user visible quirks.

The overheads is based on the number of regions actually defined in
the platform. It may be 2-4 in smaller systems.  This split is what
makes the allocations and reclaims work withing these boundaries using
the zone's active, inactive lists on a per memory regions basis.

An external structure to just capture the boundaries would have less
overheads, but does not provide enough hooks to influence the zone
level allocators and reclaim operations.

> How big _is_ a 'struct zone' these days?  This patch will increase their
> effective size by 16x.

Yes, this is not good, we should to a runtime allocation for the exact
number of regions that we need.  This can be optimized later once we
design the data structure hierarchy with least overhead for the
purpose.

> Since one distro kernel basically gets run on *EVERYTHING*, what will
> MAX_NR_REGIONS be in practice?  How many regions are there on the
> largest systems that will need this?  We're going to be doing many
> linear searches and iterations over it, so it's pretty darn important to
> know.  What does this do to lmbench numbers sensitive to page
> allocations?

Yep, agreed, we are generally looking at 2-4 regions per-node for most
purposes.  Also regions need not be of equal size, they can be large
and small based on platform characteristics so that we need not
fragment the zones below the level required.

The overall idea is to have a VM data structure that can capture
various boundaries of memory, and enable the allocations and reclaim
logic to target certain areas based on the boundaries and properties
required.  NUMA node and pgdat is the example of capturing memory
distances.  The proposed memory regions should capture other
orthogonal properties and boundaries of memory addresses similar to
zone type.

Thanks for the quick feedback.

--Vaidy
diff mbox

Patch

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e56f835..997a474 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -60,6 +60,7 @@  struct free_area {
 };
 
 struct pglist_data;
+struct mem_region_list_data;
 
 /*
  * zone->lock and zone->lru_lock are two of the hottest locks in the kernel.
@@ -311,6 +312,7 @@  struct zone {
 	unsigned long		min_unmapped_pages;
 	unsigned long		min_slab_pages;
 #endif
+	int region;
 	struct per_cpu_pageset __percpu *pageset;
 	/*
 	 * free areas of different sizes
@@ -399,6 +401,8 @@  struct zone {
 	 * Discontig memory support fields.
 	 */
 	struct pglist_data	*zone_pgdat;
+	struct mem_region_list_data	*zone_mem_region;
+
 	/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
 	unsigned long		zone_start_pfn;
 
@@ -597,6 +601,19 @@  struct node_active_region {
 extern struct page *mem_map;
 #endif
 
+typedef struct mem_region_list_data {
+	struct zone zones[MAX_NR_ZONES];
+	int nr_zones;
+
+	int node;
+	int region;
+
+	unsigned long start_pfn;
+	unsigned long spanned_pages;
+} mem_region_t;
+
+#define MAX_NR_REGIONS    16
+
 /*
  * The pg_data_t structure is used in machines with CONFIG_DISCONTIGMEM
  * (mostly NUMA machines?) to denote a higher-level memory zone than the
@@ -610,7 +627,10 @@  extern struct page *mem_map;
  */
 struct bootmem_data;
 typedef struct pglist_data {
-	struct zone node_zones[MAX_NR_ZONES];
+/*	The linkage to node_zones is now removed. The new hierarchy introduced
+ *	is pg_data_t -> mem_region -> zones
+ * 	struct zone node_zones[MAX_NR_ZONES];
+ */
 	struct zonelist node_zonelists[MAX_ZONELISTS];
 	int nr_zones;
 #ifdef CONFIG_FLAT_NODE_MEM_MAP	/* means !SPARSEMEM */
@@ -632,6 +652,9 @@  typedef struct pglist_data {
 	 */
 	spinlock_t node_size_lock;
 #endif
+	mem_region_t mem_regions[MAX_NR_REGIONS];
+	int nr_mem_regions;
+
 	unsigned long node_start_pfn;
 	unsigned long node_present_pages; /* total number of physical pages */
 	unsigned long node_spanned_pages; /* total size of physical page