diff mbox

x86/pci: make pci_mem_start to be aligned only -v4

Message ID 49E99054.6050208@kernel.org (mailing list archive)
State Superseded, archived
Headers show

Commit Message

Yinghai Lu April 18, 2009, 8:33 a.m. UTC
Linus Torvalds wrote:
> 
> On Thu, 16 Apr 2009, Yinghai Lu wrote:
>> please check.
>>
>> [PATCH] x86/pci: make pci_mem_start to be aligned only -v4
> 
> I like the approach. That said, I think that rather than do the "modify 
> the e820 array" thing, why not just do it in the in the resource tree, and 
> do it at "e820_reserve_resources_late()" time?
> 
> IOW, something like this.
> 
> TOTALLY UNTESTED! The point is to take all RAM resources we haev, and 
> _after_ we've added all the resources we've seen in the E820 tree, we then 
> _also_ try to add fake reserved entries for any "round up to X" at the end 
> of the RAM resources.
> 
> NOTE! I really didn't want to use "reserve_region_with_split()". I didn't 
> want to recurse into any conflicting resources, I really wanted to just do 
> the other failure cases. 
> 
> THIS PATCH IS NOT MEANT TO BE USED. Just a rough "almost like this" kind 
> of thing. That includes the rough draft of how much to round things up to 
> based on where the end of RAM region is etc. I'm really throwing this out 
> more as a "wouldn't this be a readable way to handle any missing reserved 
> entries" kind of thing..
> 
> 			Linus
> 
> ---
>  arch/x86/kernel/e820.c |   34 ++++++++++++++++++++++++++++++++++
>  1 files changed, 34 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index ef2c356..e8b8d33 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -1370,6 +1370,23 @@ void __init e820_reserve_resources(void)
>  	}
>  }
>  
> +/* How much should we pad RAM ending depending on where it is? */
> +static unsigned long ram_alignment(resource_size_t pos)
> +{
> +	unsigned long mb = pos >> 20;
> +
> +	/* To 64kB in the first megabyte */
> +	if (!mb)
> +		return 64*1024;
> +
> +	/* To 1MB in the first 16MB */
> +	if (mb < 16)
> +		return 1024*1024;
> +
> +	/* To 32MB for anything above that */
> +	return 32*1024*1024;
> +}
> +
>  void __init e820_reserve_resources_late(void)
>  {
>  	int i;
> @@ -1381,6 +1398,23 @@ void __init e820_reserve_resources_late(void)
>  			insert_resource_expand_to_fit(&iomem_resource, res);
>  		res++;
>  	}
> +
> +	/*
> +	 * Try to bump up RAM regions to reasonable boundaries to
> +	 * avoid stolen RAM
> +	 */
> +	for (i = 0; i < e820.nr_map; i++) {
> +		struct e820entry *entry = &e820_saved.map[i];
> +		resource_size_t start, end;
> +
> +		if (entry->type != E820_RAM)
> +			continue;
> +		start = entry->addr + entry->size;
> +		end = round_up(start, ram_alignment(start));
> +		if (start == end)
> +			continue;
> +		reserve_region_with_split(&iomem_resource, start, end, "RAM buffer");
> +	}
>  }
>  
>  char *__init default_machine_specific_memory_setup(void)

except need to change 
> +		reserve_region_with_split(&iomem_resource, start, end, "RAM buffer");
==> > +		reserve_region_with_split(&iomem_resource, start, end - 1, "RAM buffer");

it will make sure dynmical allocating code will not use those range.

and could make e820_setup_gap much simple.

---
 arch/x86/kernel/e820.c |   10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)


Ingo, can you put those two patches in tip?

YH
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Ingo Molnar April 18, 2009, 9:22 a.m. UTC | #1
* Yinghai Lu <yinghai@kernel.org> wrote:

> Linus Torvalds wrote:
> > 
> > On Thu, 16 Apr 2009, Yinghai Lu wrote:
> >> please check.
> >>
> >> [PATCH] x86/pci: make pci_mem_start to be aligned only -v4
> > 
> > I like the approach. That said, I think that rather than do the "modify 
> > the e820 array" thing, why not just do it in the in the resource tree, and 
> > do it at "e820_reserve_resources_late()" time?
> > 
> > IOW, something like this.
> > 
> > TOTALLY UNTESTED! The point is to take all RAM resources we haev, and 
> > _after_ we've added all the resources we've seen in the E820 tree, we then 
> > _also_ try to add fake reserved entries for any "round up to X" at the end 
> > of the RAM resources.
> > 
> > NOTE! I really didn't want to use "reserve_region_with_split()". I didn't 
> > want to recurse into any conflicting resources, I really wanted to just do 
> > the other failure cases. 
> > 
> > THIS PATCH IS NOT MEANT TO BE USED. Just a rough "almost like this" kind 
> > of thing. That includes the rough draft of how much to round things up to 
> > based on where the end of RAM region is etc. I'm really throwing this out 
> > more as a "wouldn't this be a readable way to handle any missing reserved 
> > entries" kind of thing..
> > 
> > 			Linus
> > 
> > ---
> >  arch/x86/kernel/e820.c |   34 ++++++++++++++++++++++++++++++++++
> >  1 files changed, 34 insertions(+), 0 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > index ef2c356..e8b8d33 100644
> > --- a/arch/x86/kernel/e820.c
> > +++ b/arch/x86/kernel/e820.c
> > @@ -1370,6 +1370,23 @@ void __init e820_reserve_resources(void)
> >  	}
> >  }
> >  
> > +/* How much should we pad RAM ending depending on where it is? */
> > +static unsigned long ram_alignment(resource_size_t pos)
> > +{
> > +	unsigned long mb = pos >> 20;
> > +
> > +	/* To 64kB in the first megabyte */
> > +	if (!mb)
> > +		return 64*1024;
> > +
> > +	/* To 1MB in the first 16MB */
> > +	if (mb < 16)
> > +		return 1024*1024;
> > +
> > +	/* To 32MB for anything above that */
> > +	return 32*1024*1024;
> > +}
> > +
> >  void __init e820_reserve_resources_late(void)
> >  {
> >  	int i;
> > @@ -1381,6 +1398,23 @@ void __init e820_reserve_resources_late(void)
> >  			insert_resource_expand_to_fit(&iomem_resource, res);
> >  		res++;
> >  	}
> > +
> > +	/*
> > +	 * Try to bump up RAM regions to reasonable boundaries to
> > +	 * avoid stolen RAM
> > +	 */
> > +	for (i = 0; i < e820.nr_map; i++) {
> > +		struct e820entry *entry = &e820_saved.map[i];
> > +		resource_size_t start, end;
> > +
> > +		if (entry->type != E820_RAM)
> > +			continue;
> > +		start = entry->addr + entry->size;
> > +		end = round_up(start, ram_alignment(start));
> > +		if (start == end)
> > +			continue;
> > +		reserve_region_with_split(&iomem_resource, start, end, "RAM buffer");
> > +	}
> >  }
> >  
> >  char *__init default_machine_specific_memory_setup(void)
> 
> except need to change 
> > +		reserve_region_with_split(&iomem_resource, start, end, "RAM buffer");
> ==> > +		reserve_region_with_split(&iomem_resource, start, end - 1, "RAM buffer");
> 
> it will make sure dynmical allocating code will not use those range.
> 
> and could make e820_setup_gap much simple.
> 
> ---
>  arch/x86/kernel/e820.c |   10 ++++------
>  1 file changed, 4 insertions(+), 6 deletions(-)
> 
> Index: linux-2.6/arch/x86/kernel/e820.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/e820.c
> +++ linux-2.6/arch/x86/kernel/e820.c
> @@ -635,14 +635,12 @@ __init void e820_setup_gap(void)
>  #endif
>  
>  	/*
> -	 * See how much we want to round up: start off with
> -	 * rounding to the next 1MB area.
> +	 * e820_reserve_resources_late will protect stolen RAM
> +	 * so just round it to 1M
>  	 */
>  	round = 0x100000;
> -	while ((gapsize >> 4) > round)
> -		round += round;
> -	/* Fun with two's complement */
> -	pci_mem_start = (gapstart + round) & -round;
> +
> +	pci_mem_start = roundup(gapstart, round);
>  
>  	printk(KERN_INFO
>  	       "Allocating PCI resources starting at %lx (gap: %lx:%lx)\n",
> 
> Ingo, can you put those two patches in tip?

I think the point would be to explore the possibility to have no 
'gap' logic at all - we should extend the e820 table with Linus's 
scheme to add 'RAM buffer' entries.

That way, if you search for a sufficient size hole, it will be 
correctly aligned straight away - with no rounding/gap logic at all. 
(Maybe add a debug warning to catch when this is not the case.)

Am i missing something?

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yinghai Lu April 18, 2009, 5:07 p.m. UTC | #2
Ingo Molnar wrote:
> * Yinghai Lu <yinghai@kernel.org> wrote:
> 
>> Linus Torvalds wrote:
>>> On Thu, 16 Apr 2009, Yinghai Lu wrote:
>>>> please check.
>>>>
>>>> [PATCH] x86/pci: make pci_mem_start to be aligned only -v4
>>> I like the approach. That said, I think that rather than do the "modify 
>>> the e820 array" thing, why not just do it in the in the resource tree, and 
>>> do it at "e820_reserve_resources_late()" time?
>>>
>>> IOW, something like this.
>>>
>>> TOTALLY UNTESTED! The point is to take all RAM resources we haev, and 
>>> _after_ we've added all the resources we've seen in the E820 tree, we then 
>>> _also_ try to add fake reserved entries for any "round up to X" at the end 
>>> of the RAM resources.
>>>
>>> NOTE! I really didn't want to use "reserve_region_with_split()". I didn't 
>>> want to recurse into any conflicting resources, I really wanted to just do 
>>> the other failure cases. 
>>>
>>> THIS PATCH IS NOT MEANT TO BE USED. Just a rough "almost like this" kind 
>>> of thing. That includes the rough draft of how much to round things up to 
>>> based on where the end of RAM region is etc. I'm really throwing this out 
>>> more as a "wouldn't this be a readable way to handle any missing reserved 
>>> entries" kind of thing..
>>>
>>> 			Linus
>>>
>>> ---
>>>  arch/x86/kernel/e820.c |   34 ++++++++++++++++++++++++++++++++++
>>>  1 files changed, 34 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
>>> index ef2c356..e8b8d33 100644
>>> --- a/arch/x86/kernel/e820.c
>>> +++ b/arch/x86/kernel/e820.c
>>> @@ -1370,6 +1370,23 @@ void __init e820_reserve_resources(void)
>>>  	}
>>>  }
>>>  
>>> +/* How much should we pad RAM ending depending on where it is? */
>>> +static unsigned long ram_alignment(resource_size_t pos)
>>> +{
>>> +	unsigned long mb = pos >> 20;
>>> +
>>> +	/* To 64kB in the first megabyte */
>>> +	if (!mb)
>>> +		return 64*1024;
>>> +
>>> +	/* To 1MB in the first 16MB */
>>> +	if (mb < 16)
>>> +		return 1024*1024;
>>> +
>>> +	/* To 32MB for anything above that */
>>> +	return 32*1024*1024;
>>> +}
>>> +
>>>  void __init e820_reserve_resources_late(void)
>>>  {
>>>  	int i;
>>> @@ -1381,6 +1398,23 @@ void __init e820_reserve_resources_late(void)
>>>  			insert_resource_expand_to_fit(&iomem_resource, res);
>>>  		res++;
>>>  	}
>>> +
>>> +	/*
>>> +	 * Try to bump up RAM regions to reasonable boundaries to
>>> +	 * avoid stolen RAM
>>> +	 */
>>> +	for (i = 0; i < e820.nr_map; i++) {
>>> +		struct e820entry *entry = &e820_saved.map[i];
>>> +		resource_size_t start, end;
>>> +
>>> +		if (entry->type != E820_RAM)
>>> +			continue;
>>> +		start = entry->addr + entry->size;
>>> +		end = round_up(start, ram_alignment(start));
>>> +		if (start == end)
>>> +			continue;
>>> +		reserve_region_with_split(&iomem_resource, start, end, "RAM buffer");
>>> +	}
>>>  }
>>>  
>>>  char *__init default_machine_specific_memory_setup(void)
>> except need to change 
>>> +		reserve_region_with_split(&iomem_resource, start, end, "RAM buffer");
>> ==> > +		reserve_region_with_split(&iomem_resource, start, end - 1, "RAM buffer");
>>
>> it will make sure dynmical allocating code will not use those range.
>>
>> and could make e820_setup_gap much simple.
>>
>> ---
>>  arch/x86/kernel/e820.c |   10 ++++------
>>  1 file changed, 4 insertions(+), 6 deletions(-)
>>
>> Index: linux-2.6/arch/x86/kernel/e820.c
>> ===================================================================
>> --- linux-2.6.orig/arch/x86/kernel/e820.c
>> +++ linux-2.6/arch/x86/kernel/e820.c
>> @@ -635,14 +635,12 @@ __init void e820_setup_gap(void)
>>  #endif
>>  
>>  	/*
>> -	 * See how much we want to round up: start off with
>> -	 * rounding to the next 1MB area.
>> +	 * e820_reserve_resources_late will protect stolen RAM
>> +	 * so just round it to 1M
>>  	 */
>>  	round = 0x100000;
>> -	while ((gapsize >> 4) > round)
>> -		round += round;
>> -	/* Fun with two's complement */
>> -	pci_mem_start = (gapstart + round) & -round;
>> +
>> +	pci_mem_start = roundup(gapstart, round);
>>  
>>  	printk(KERN_INFO
>>  	       "Allocating PCI resources starting at %lx (gap: %lx:%lx)\n",
>>
>> Ingo, can you put those two patches in tip?
> 
> I think the point would be to explore the possibility to have no 
> 'gap' logic at all - we should extend the e820 table with Linus's 
> scheme to add 'RAM buffer' entries.
> 
so you prefer the old one aka the -v4, and add new entry type for RAM Buffer?

YH
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds April 18, 2009, 6:57 p.m. UTC | #3
On Sat, 18 Apr 2009, Ingo Molnar wrote:
> 
> Am i missing something?

We also try to avoid random motherboard resources etc that aren't reserved 
or documented by the BIOS. It's better to go into big holes. It's also 
better to try to keep as close to the old (tested) behavior.

Now, admittedly those undocumented resources are _much_ more common in IO 
space, but still. They're _very_ common. Om my modern Nehalem thing with 
an Intel BIOS (supposedly "good" and not from some random manufacturer), I 
have, for example:

[   26.533771] pci 0000:00:1f.0: ICH7 LPC Generic IO decode 2 PIO at 0810 (mask 007f)

byt that one isn't covered by any PnP range or anythign else.

[ Now, it's possible that it's bogus: "0x810" has a bit set in the same 
  bits that cover the mask, and I don't know if the mask is a "ignore 
  these bits" (and the range would thus match all of 0x0800-0x087f) or if 
  the mast is a "port & ~mask == base" in which case nothing would ever 
  match.

  But I _think_ the BIOS literally set up something to answer int he 
  0x08?? range, and didn't document it anywhere. The same can be true of 
  MMIO too, and so we should try to avoid using random memory areas if we 
  can ]

			Linus

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar April 18, 2009, 7:14 p.m. UTC | #4
* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, 18 Apr 2009, Ingo Molnar wrote:
> > 
> > Am i missing something?
> 
> We also try to avoid random motherboard resources etc that aren't 
> reserved or documented by the BIOS. It's better to go into big 
> holes. It's also better to try to keep as close to the old 
> (tested) behavior.

Yeah - i'm not suggesting any change in behavior, nor am i 
suggesting any risky behavior. The current code seems to work quite 
well.

I'm just suggesting (maybe foolishly) that instead of having any 
gap-rounding logic at all, add artificial entries to the e820 map to 
'extend' and round up any odd ending entries.

I.e. explicitly manage all the 'hole' space to be nicely rounded and 
to be far away from any T-Seg or other sekrit motherboard resource 
danger area.

We'd do this after PCI static allocations (so we dont ever stomp on 
real, known resources) but before PCI dynamic allocations.

The e820 printout would look literally like this:

BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)       0.639 MB RAM
 BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)     0.001 MB
                                                [ hole ]       0.250 MB
 BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)     0.125 MB
 BIOS-e820: 0000000000100000 - 000000003ed94000 (usable)    1004.5   MB RAM
 BIOS-e820: 000000003ed94000 - 000000003ee4e000 (ACPI NVS)     0.7   MB
 BIOS-e820: 000000003ee4e000 - 000000003fea2000 (usable)      16.3   MB RAM
 BIOS-e820: 000000003fea2000 - 000000003fee9000 (ACPI NVS)     0.3   MB
 BIOS-e820: 000000003fee9000 - 000000003feed000 (usable)       0.15  MB RAM
 BIOS-e820: 000000003feed000 - 000000003feff000 (ACPI data     0.07  MB
 BIOS-e820: 000000003feff000 - 000000003ff00000 (usable)       0.004 MB RAM
 BIOS-e820: 000000003ff00000 - 0000000040000000 (guard)        1.0   MB
                                               	[ hole ]    3072.0   MB

The '(guard)' entry at the end i added above.

This way we intentionally create a 'free physical address space' 
hole space that is the same as the rounding logic. No rounding 
needed anywhere - as all the remaining address space is well-rounded 
already. Plus we'd also _see_ all our rounding logic by looking at 
the '(guard)' entries.

Or maybe there's some aspect of gap-rounding that cannot be 
expressed in such a static way?

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ivan Kokshaysky April 18, 2009, 8:13 p.m. UTC | #5
On Sat, Apr 18, 2009 at 09:14:25PM +0200, Ingo Molnar wrote:
> This way we intentionally create a 'free physical address space' 
> hole space that is the same as the rounding logic. No rounding 
> needed anywhere - as all the remaining address space is well-rounded 
> already. Plus we'd also _see_ all our rounding logic by looking at 
> the '(guard)' entries.
> 
> Or maybe there's some aspect of gap-rounding that cannot be 
> expressed in such a static way?

My gut feeling is that you guys do overcomplicate a simple issue
which can be fixed with a one-liner like this:

	pci_mem_start = pci_mem_start < 0xc0000000 ? : 0xc0000000;

This 0xc0000000 (3G) seems to be a pretty fundamental thing for certain
32-bit OS. ;-)

Ivan.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Index: linux-2.6/arch/x86/kernel/e820.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/e820.c
+++ linux-2.6/arch/x86/kernel/e820.c
@@ -635,14 +635,12 @@  __init void e820_setup_gap(void)
 #endif
 
 	/*
-	 * See how much we want to round up: start off with
-	 * rounding to the next 1MB area.
+	 * e820_reserve_resources_late will protect stolen RAM
+	 * so just round it to 1M
 	 */
 	round = 0x100000;
-	while ((gapsize >> 4) > round)
-		round += round;
-	/* Fun with two's complement */
-	pci_mem_start = (gapstart + round) & -round;
+
+	pci_mem_start = roundup(gapstart, round);
 
 	printk(KERN_INFO
 	       "Allocating PCI resources starting at %lx (gap: %lx:%lx)\n",