SQUASHME: Fixes to e820 handling of pmem
diff mbox

Message ID 551BFFD2.20608@plexistor.com
State New, archived
Headers show

Commit Message

Boaz Harrosh April 1, 2015, 2:25 p.m. UTC
Finally with these fixes I'm able to define memmap=!
regions in NUMA machines. Any combination cross or
not cross NUMA boundary.

And not only the memmap=! regions had problems also
the real type-12 NvDIMMs had the same NUMA problems.
Now it all works.

Also
I have kept the "don't merge PRAM" regions for ease
of emulated NUMA setups.

Also I have reverted the change Ch did to
e820_mark_nosave_regions. From comment above the function
and if I'm reading register_nosave_region() correctly,
We certainly do not want the system to try and save any
pmem to swap or hibernate. (Actually it will be the
opposite right). Can we actually define swap on a /dev/pmemX ? ;-)

Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 Documentation/kernel-parameters.txt |  6 ++++++
 arch/x86/kernel/e820.c              | 20 +++++++++-----------
 2 files changed, 15 insertions(+), 11 deletions(-)

Comments

hch@lst.de April 2, 2015, 9:30 a.m. UTC | #1
On Wed, Apr 01, 2015 at 05:25:22PM +0300, Boaz Harrosh wrote:
>  		pfn = PFN_DOWN(ei->addr + ei->size);
>  
> -		switch (ei->type) {
> -		case E820_RAM:
> -		case E820_PRAM:
> -		case E820_RESERVED_KERN:
> -			break;
> -		default:
> +		if (ei->type != E820_RAM && ei->type != E820_RESERVED_KERN)
>  			register_nosave_region(PFN_UP(ei->addr), pfn);
> -		}

I guess this makes sense - if the content is persistent already we don't need
to save it.

> -		if (e820.map[i].type != E820_RESERVED || res->start < (1ULL<<20)) {
> -			if (e820.map[i].type != E820_PRAM)
> -				res->flags |= IORESOURCE_BUSY;
> +		if (((e820.map[i].type != E820_RESERVED) &&
> +		     (e820.map[i].type != E820_PRAM)) ||
> +		     res->start < (1ULL<<20)) {

So now we also trigger for PRAM regions under 1ULL<<20, was that the
intentional change?  Honestly I don't really understand this 1ULL<<20
magic here even for the existing case.  Guess this is magic from the
old ISA PC days?  

> +			res->flags |= IORESOURCE_BUSY;

Guess this is the real change, and I'd love to understand why this
makes a difference for you.  IORESOURCE_BUSY is checked almost never,
and is intented to mean it's a driver mapping.
Ingo Molnar April 2, 2015, 9:37 a.m. UTC | #2
* Christoph Hellwig <hch@lst.de> wrote:

> On Wed, Apr 01, 2015 at 05:25:22PM +0300, Boaz Harrosh wrote:
> >  		pfn = PFN_DOWN(ei->addr + ei->size);
> >  
> > -		switch (ei->type) {
> > -		case E820_RAM:
> > -		case E820_PRAM:
> > -		case E820_RESERVED_KERN:
> > -			break;
> > -		default:
> > +		if (ei->type != E820_RAM && ei->type != E820_RESERVED_KERN)
> >  			register_nosave_region(PFN_UP(ei->addr), pfn);
> > -		}
> 
> I guess this makes sense - if the content is persistent already we don't need
> to save it.
> 
> > -		if (e820.map[i].type != E820_RESERVED || res->start < (1ULL<<20)) {
> > -			if (e820.map[i].type != E820_PRAM)
> > -				res->flags |= IORESOURCE_BUSY;
> > +		if (((e820.map[i].type != E820_RESERVED) &&
> > +		     (e820.map[i].type != E820_PRAM)) ||
> > +		     res->start < (1ULL<<20)) {
> 
> So now we also trigger for PRAM regions under 1ULL<<20, was that the
> intentional change?  Honestly I don't really understand this 1ULL<<20
> magic here even for the existing case.  Guess this is magic from the
> old ISA PC days?  
> 
> > +			res->flags |= IORESOURCE_BUSY;
> 
> Guess this is the real change, and I'd love to understand why this 
> makes a difference for you.  IORESOURCE_BUSY is checked almost 
> never, and is intented to mean it's a driver mapping.

So assuming this works on your test setup I'm inclined to squash 
Boaz's fixes into the original patch, assuming you see no outright bug 
in them. Anything else can be done as delta improvements.

Agreed?

Thanks,

	Ingo
hch@lst.de April 2, 2015, 9:40 a.m. UTC | #3
On Thu, Apr 02, 2015 at 11:37:40AM +0200, Ingo Molnar wrote:
> So assuming this works on your test setup I'm inclined to squash 
> Boaz's fixes into the original patch, assuming you see no outright bug 
> in them. Anything else can be done as delta improvements.

It looks sensible, but I'd really like to understand the changes a bit
better.  In the meantime I'll test them on my setup.
hch@lst.de April 2, 2015, 11:18 a.m. UTC | #4
On Thu, Apr 02, 2015 at 11:37:40AM +0200, Ingo Molnar wrote:
> So assuming this works on your test setup

Boaz's changes work fine here.
Boaz Harrosh April 2, 2015, 11:20 a.m. UTC | #5
On 04/02/2015 12:30 PM, Christoph Hellwig wrote:
> On Wed, Apr 01, 2015 at 05:25:22PM +0300, Boaz Harrosh wrote:
>>  		pfn = PFN_DOWN(ei->addr + ei->size);
>>  
>> -		switch (ei->type) {
>> -		case E820_RAM:
>> -		case E820_PRAM:
>> -		case E820_RESERVED_KERN:
>> -			break;
>> -		default:
>> +		if (ei->type != E820_RAM && ei->type != E820_RESERVED_KERN)
>>  			register_nosave_region(PFN_UP(ei->addr), pfn);
>> -		}
> 
> I guess this makes sense - if the content is persistent already we don't need
> to save it.
> 

Yes

>> -		if (e820.map[i].type != E820_RESERVED || res->start < (1ULL<<20)) {
>> -			if (e820.map[i].type != E820_PRAM)
>> -				res->flags |= IORESOURCE_BUSY;
>> +		if (((e820.map[i].type != E820_RESERVED) &&
>> +		     (e820.map[i].type != E820_PRAM)) ||
>> +		     res->start < (1ULL<<20)) {
> 
> So now we also trigger for PRAM regions under 1ULL<<20, was that the
> intentional change?  Honestly I don't really understand this 1ULL<<20
> magic here even for the existing case.  Guess this is magic from the
> old ISA PC days?  
> 

OK so by now I have had a chance to test all cases and figure out
what happened. I was running on your V1 and then V2. And had huge
problems with NUMA. The thing that actually fixed everything and
brought the system back to life, was already in your V3.

It was the remove of reserve_pmem() and the call to memblock_reserve()
and init_memory_mapping() from within. It was making a mess.

So your V3 was already running nice. But I already fixed everything
on my side by the time you sent V3, and what I sent here is the
diff from what I had and your V3.

But these are all good fixes still. (Though not fatal as V2 was)

3 small fixes here:
* Adding back the memmap=! now that everything works perfectly
  as expected.

* The one above that fixes register_nosave_region

and ...
>> +			res->flags |= IORESOURCE_BUSY;
> 
> Guess this is the real change, and I'd love to understand why this
> makes a difference for you.  IORESOURCE_BUSY is checked almost never,
> and is intented to mean it's a driver mapping.

This here is a very minor thing. I have lots of experience with this
code and with the internals of insert_resource() from my old e820.c
fixes (Which are BTW still relevant but no longer needed for any
current platform)

So what happens here is just the adding of the IORESOURCE_BUSY flag.
Actually all these resources are already in the resource list and this
code is just changing the flag. So if you are not changing the flag there
is just no point in calling insert_resource(). It will change nothing.

From what I understand from the history of this file and from prints
that I put in this file and at the resource manager. The logic is that
it wants to make all E820_XXX busy so to not let any driver try and claim them.
And Also the code does not want to allow any kind of e820 resource below the 1M
be allowed for drivers, reserved or otherwise.

Any E820_RESERVED regions it allows for driver use by not setting IORESOURCE_BUSY.

Your code and mine are effectively the same only that I optimize out the call to
insert_resource() since the flags were not changed.

Testing status:
We had some birth problems with the new system and the APIs that changed so the
testing rig broke half through the night. But we have wrinkled out the last
problems and the machines are pumping away full steam, NUMA and everything.
So far it looks good. I will very soon now, Send you a tested-by: That
confirms these patches are just as good as what we had until now out-of-tree.
(I'm running with my page-struct patches on top of your pmem driver. Will publish
 trees soon)

Thank you for your patience
Boaz

Patch
diff mbox

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index bfcb1a6..c87122d 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1965,6 +1965,12 @@  bytes respectively. Such letter suffixes can also be entirely omitted.
 			         or
 			         memmap=0x10000$0x18690000
 
+	memmap=nn[KMG]!ss[KMG]
+			[KNL,X86] Mark specific memory as protected.
+			Region of memory to be used, from ss to ss+nn.
+			The memory region may be marked as e820 type 12 (0xc)
+			and is NVDIMM or ADR memory.
+
 	memory_corruption_check=0/1 [X86]
 			Some BIOSes seem to corrupt the first 64k of
 			memory when doing things like suspend/resume.
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index e26ca56..3572a22 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -346,7 +346,7 @@  int __init sanitize_e820_map(struct e820entry *biosmap, int max_nr_map,
 		 * continue building up new bios map based on this
 		 * information
 		 */
-		if (current_type != last_type)	{
+		if (current_type != last_type || current_type == E820_PRAM) {
 			if (last_type != 0)	 {
 				new_bios[new_bios_entry].size =
 					change_point[chgidx]->addr - last_addr;
@@ -692,14 +692,8 @@  void __init e820_mark_nosave_regions(unsigned long limit_pfn)
 
 		pfn = PFN_DOWN(ei->addr + ei->size);
 
-		switch (ei->type) {
-		case E820_RAM:
-		case E820_PRAM:
-		case E820_RESERVED_KERN:
-			break;
-		default:
+		if (ei->type != E820_RAM && ei->type != E820_RESERVED_KERN)
 			register_nosave_region(PFN_UP(ei->addr), pfn);
-		}
 
 		if (pfn >= limit_pfn)
 			break;
@@ -880,6 +874,9 @@  static int __init parse_memmap_one(char *p)
 	} else if (*p == '$') {
 		start_at = memparse(p+1, &p);
 		e820_add_region(start_at, mem_size, E820_RESERVED);
+	} else if (*p == '!') {
+		start_at = memparse(p+1, &p);
+		e820_add_region(start_at, mem_size, E820_PRAM);
 	} else
 		e820_remove_range(mem_size, ULLONG_MAX - mem_size, E820_RAM, 1);
 
@@ -955,9 +952,10 @@  void __init e820_reserve_resources(void)
 		 * pci device BAR resource and insert them later in
 		 * pcibios_resource_survey()
 		 */
-		if (e820.map[i].type != E820_RESERVED || res->start < (1ULL<<20)) {
-			if (e820.map[i].type != E820_PRAM)
-				res->flags |= IORESOURCE_BUSY;
+		if (((e820.map[i].type != E820_RESERVED) &&
+		     (e820.map[i].type != E820_PRAM)) ||
+		     res->start < (1ULL<<20)) {
+			res->flags |= IORESOURCE_BUSY;
 			insert_resource(&iomem_resource, res);
 		}
 		res++;