diff mbox

x86/pci: make pci_mem_start to be aligned only -v4

Message ID alpine.LFD.2.00.0904161540590.4042@localhost.localdomain (mailing list archive)
State Superseded, archived
Headers show

Commit Message

Linus Torvalds April 16, 2009, 11:18 p.m. UTC
On Thu, 16 Apr 2009, Yinghai Lu wrote:
>
> please check.
> 
> [PATCH] x86/pci: make pci_mem_start to be aligned only -v4

I like the approach. That said, I think that rather than do the "modify 
the e820 array" thing, why not just do it in the in the resource tree, and 
do it at "e820_reserve_resources_late()" time?

IOW, something like this.

TOTALLY UNTESTED! The point is to take all RAM resources we haev, and 
_after_ we've added all the resources we've seen in the E820 tree, we then 
_also_ try to add fake reserved entries for any "round up to X" at the end 
of the RAM resources.

NOTE! I really didn't want to use "reserve_region_with_split()". I didn't 
want to recurse into any conflicting resources, I really wanted to just do 
the other failure cases. 

THIS PATCH IS NOT MEANT TO BE USED. Just a rough "almost like this" kind 
of thing. That includes the rough draft of how much to round things up to 
based on where the end of RAM region is etc. I'm really throwing this out 
more as a "wouldn't this be a readable way to handle any missing reserved 
entries" kind of thing..

			Linus

---
 arch/x86/kernel/e820.c |   34 ++++++++++++++++++++++++++++++++++
 1 files changed, 34 insertions(+), 0 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Ingo Molnar April 16, 2009, 11:54 p.m. UTC | #1
* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, 16 Apr 2009, Yinghai Lu wrote:
> >
> > please check.
> > 
> > [PATCH] x86/pci: make pci_mem_start to be aligned only -v4
> 
> I like the approach. That said, I think that rather than do the "modify 
> the e820 array" thing, why not just do it in the in the resource tree, and 
> do it at "e820_reserve_resources_late()" time?
> 
> IOW, something like this.
> 
> TOTALLY UNTESTED! The point is to take all RAM resources we haev, and 
> _after_ we've added all the resources we've seen in the E820 tree, we then 
> _also_ try to add fake reserved entries for any "round up to X" at the end 
> of the RAM resources.
> 
> NOTE! I really didn't want to use "reserve_region_with_split()". I didn't 
> want to recurse into any conflicting resources, I really wanted to just do 
> the other failure cases. 
> 
> THIS PATCH IS NOT MEANT TO BE USED. Just a rough "almost like this" kind 
> of thing. That includes the rough draft of how much to round things up to 
> based on where the end of RAM region is etc. I'm really throwing this out 
> more as a "wouldn't this be a readable way to handle any missing reserved 
> entries" kind of thing..
> 
> 			Linus
> 
> ---
>  arch/x86/kernel/e820.c |   34 ++++++++++++++++++++++++++++++++++
>  1 files changed, 34 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index ef2c356..e8b8d33 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -1370,6 +1370,23 @@ void __init e820_reserve_resources(void)
>  	}
>  }
>  
> +/* How much should we pad RAM ending depending on where it is? */
> +static unsigned long ram_alignment(resource_size_t pos)
> +{
> +	unsigned long mb = pos >> 20;
> +
> +	/* To 64kB in the first megabyte */
> +	if (!mb)
> +		return 64*1024;

Could we perhaps round up to 1MB in this case too?

This would mean that we would never dynamically allocate into the 
640k..1MB hole - but even if it's free it's probably a good idea to 
avoid that range - it's usually quite special.

So we'd just have two ganularities for round-up: 1MB and 32MB. Nice, 
predictable scheme.

> +
> +	/* To 1MB in the first 16MB */
> +	if (mb < 16)
> +		return 1024*1024;
> +
> +	/* To 32MB for anything above that */
> +	return 32*1024*1024;
> +}
> +
>  void __init e820_reserve_resources_late(void)
>  {
>  	int i;
> @@ -1381,6 +1398,23 @@ void __init e820_reserve_resources_late(void)
>  			insert_resource_expand_to_fit(&iomem_resource, res);
>  		res++;
>  	}
> +
> +	/*
> +	 * Try to bump up RAM regions to reasonable boundaries to
> +	 * avoid stolen RAM
> +	 */
> +	for (i = 0; i < e820.nr_map; i++) {
> +		struct e820entry *entry = &e820_saved.map[i];
> +		resource_size_t start, end;
> +
> +		if (entry->type != E820_RAM)
> +			continue;

Would it make sense to round up everything that is listed in the 
E820 map? Just in case the BIOS is not entirely honest about the 
true extent of that area.

It might even make sense to add a small 'guard' area next to such 
ranges (even if they are aligned well): to prevent the BIOS from 
accidentally overruning into a device BAR we allocate next to it.

> +		start = entry->addr + entry->size;
> +		end = round_up(start, ram_alignment(start));
> +		if (start == end)
> +			continue;

Hm, indeed, the continue is needed - reserve_region_with_split() 
lets zero-sized resources be inserted silently. I'd have missed this 
case. Do zero-sized memory resources have a special role somewhere?

[ Plus i dont see any protection against negative-size resources in 
  kernel/resource.c either. OTOH inserting a negative size resource 
  just locks down the tree for all future resources, so it would be 
  noticed for sure.

  Zero-size is not an issue it appears, it goes in and prevents only 
  the exactly same-position zero-size resource to be inserted.

  But it might sense to silently ignore zero-size resources (if we 
  dont rely on them elsewhere), or to WARN() about them and about 
  negative size resources.
]

> +		reserve_region_with_split(&iomem_resource, start, end, "RAM buffer");

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds April 17, 2009, 12:24 a.m. UTC | #2
On Fri, 17 Apr 2009, Ingo Molnar wrote:
> 
> Could we perhaps round up to 1MB in this case too?

(The below 1MB one). 

I'd argue against it, at least in this incarnation. I can well imagine 
somebody wanting to do resource management in the 640k-1M window, so..

> Would it make sense to round up everything that is listed in the 
> E820 map? Just in case the BIOS is not entirely honest about the 
> true extent of that area.

Well, it would probably work, but on the other hand, when we see 
"E820_RAM", that means that we really _can_ trust that that E820 entry is 
right, since we're going to use it as RAM (and Windows would too), and if 
it wasn't RAM, really bad things would happen.

So E820_RAM is a _lot_ more trustworthy than the other cases. If we're 
rounding up by reasonably large amounts like 32MB or even more, I really 
think we should do so for the things we really know are there, and that we 
really fundamentally know come in big granularities.

The other entries in the e820 map can reasonably be 4kB or something, 
because they are an IO-APIC or whatever. I can't say that I'd feel happy 
putting a guard area around something like that. But RAM? Sure, it can 
come in 384kB chunks (think RAM remapping for the low 1MB area), but it 
doesn't tend to happen when we're talking gigs any more.

> > +		start = entry->addr + entry->size;
> > +		end = round_up(start, ram_alignment(start));
> > +		if (start == end)
> > +			continue;
> 
> Hm, indeed, the continue is needed - reserve_region_with_split() 
> lets zero-sized resources be inserted silently. I'd have missed this 
> case. Do zero-sized memory resources have a special role somewhere?

No. But it wouldn't be a zero-size region, it would be a one-byte sized 
region. It's just that my patch was missing the "-1" from the end that I 
meant to put there:

> > +		reserve_region_with_split(&iomem_resource, start, end, "RAM buffer");

That 'end' there should be 'end-1', and that also explains why "start == 
end" must have a continue.

The 'end' in a resource region is the last byte, not the 'byte after'. 

So there was a small buglet in the patch, but as I mentioned, using 
"reserve_region_with_split()" is really wrong anyway, because we do not 
want to recurse into existing regions, just split _around_ them. So the 
patch was meant as 

			Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar April 17, 2009, 1:16 p.m. UTC | #3
* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, 17 Apr 2009, Ingo Molnar wrote:
> > 
> > Could we perhaps round up to 1MB in this case too?
> 
> (The below 1MB one).
> 
> I'd argue against it, at least in this incarnation. I can well 
> imagine somebody wanting to do resource management in the 640k-1M 
> window, so..

ok - indeed - if there's some super-small system with limited 
address lines and all physical addresses tightly packed with RAM?

> > Would it make sense to round up everything that is listed in the 
> > E820 map? Just in case the BIOS is not entirely honest about the 
> > true extent of that area.
> 
> Well, it would probably work, but on the other hand, when we see 
> "E820_RAM", that means that we really _can_ trust that that E820 
> entry is right, since we're going to use it as RAM (and Windows 
> would too), and if it wasn't RAM, really bad things would happen.
> 
> So E820_RAM is a _lot_ more trustworthy than the other cases. If 
> we're rounding up by reasonably large amounts like 32MB or even 
> more, I really think we should do so for the things we really know 
> are there, and that we really fundamentally know come in big 
> granularities.
> 
> The other entries in the e820 map can reasonably be 4kB or 
> something, because they are an IO-APIC or whatever. I can't say 
> that I'd feel happy putting a guard area around something like 
> that. But RAM? Sure, it can come in 384kB chunks (think RAM 
> remapping for the low 1MB area), but it doesn't tend to happen 
> when we're talking gigs any more.

One of my systems is a bit weird, with such a checkered RAM map:

BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)       0.639 MB RAM
 BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)     0.001 MB
                                                [ hole ]       0.250 MB
 BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)     0.125 MB
 BIOS-e820: 0000000000100000 - 000000003ed94000 (usable)    1004.5   MB RAM
 BIOS-e820: 000000003ed94000 - 000000003ee4e000 (ACPI NVS)     0.7   MB
 BIOS-e820: 000000003ee4e000 - 000000003fea2000 (usable)      16.3   MB RAM
 BIOS-e820: 000000003fea2000 - 000000003fee9000 (ACPI NVS)     0.3   MB
 BIOS-e820: 000000003fee9000 - 000000003feed000 (usable)       0.15  MB RAM
 BIOS-e820: 000000003feed000 - 000000003feff000 (ACPI data     0.07  MB
 BIOS-e820: 000000003feff000 - 000000003ff00000 (usable)       0.004 MB RAM
                                                [ hole ]       1.0   MB
                                                [ hole ]    3072.0   MB

On this map, using your scheme, we'd fill up that small 1MB hole up 
to 1GB [mockup]:

 BIOS-e820: 000000003ff00000 - 0000000040000000 (RAM buffer)

I guess that's a good thing not just for robustness: a chipset might 
be faster when DMA or mmio is on some well-isolated physical memory 
range, not too close to real RAM or other devices?

Bits of the low hole:

 00000000-0009fbff : System RAM
 0009fc00-0009ffff : reserved
 000c0000-000dffff : pnp 00:01
 000e0000-000fffff : reserved
 00100000-3ed93fff : System RAM

would still be available to dynamic PCI resources - as the 64K 
rounding would leave it alone.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yinghai Lu April 17, 2009, 9:59 p.m. UTC | #4
while testing linus's patch, found some resource name get corrupted...
even bare tip

cat /proc/iomem
00000000-000973ff : System RAM
00097400-0009ffff : reserved
000a0000-000bffff : PCI Bus #00
000c0000-000cffff : pnp 00:0c
000e0000-000fffff : reserved
00100000-b7f9ffff : System RAM
  00200000-00c67b4b : Kernel code
  00c67b4c-01331edf : Kernel data
  015a5000-01fc9657 : Kernel bss
  20000000-23ffffff : GART
b7fae000-b7faffff : System RAM
b7fb0000-b7fbdfff : ACPI Tables
b7fbe000-b7feffff : ACPI Non-volatile Storage
b7ff0000-b7ffffff : reserved
b8000000-beffffff : PCI Bus #00
bf000000-bfffffff : PCI Bus #80
  bfe80000-bfebffff : pnp 00:0e
  bfef9000-bfef9fff :
    bfef9000-bfef9fff : forcedeth
  bfefa000-bfefa00f :
    bfefa000-bfefa00f : forcedeth
  bfefa400-bfefa4ff :
    bfefa400-bfefa4ff : forcedeth
  bfefa800-bfefa80f :
    bfefa800-bfefa80f : forcedeth
  bfefac00-bfefacff :
    bfefac00-bfefacff : forcedeth
  bfefb000-bfefbfff :
    bfefb000-bfefbfff : forcedeth
  bfefc000-bfefcfff :
    bfefc000-bfefcfff : sata_nv
  bfefd000-bfefdfff :
    bfefd000-bfefdfff : sata_nv
  bfefe000-bfefefff :
    bfefe000-bfefefff : sata_nv
  bfeff000-bfefffff : IOAPIC 1
    bfeff000-bfefffff :
  bff00000-bfffffff : PCI Bus 0000:83
    bff80000-bffbffff : 0000:83:00.0
    bfffec00-bfffecff : 0000:83:00.0
      bfffec00-bfffecff : lpfc
    bffff000-bfffffff : 0000:83:00.0
      bffff000-bfffffff : lpfc
c0000000-dfffffff : PCI Bus #00
  c0000000-d9ffffff : PCI Bus 0000:03
    c0000000-cfffffff : 0000:03:00.0
    d9800000-d9ffffff : 0000:03:00.0
  da6b8c00-da6b8c0f : 0000:00:09.0
    da6b8c00-da6b8c0f : forcedeth
  da6b9000-da6b9fff : 0000:00:09.0
    da6b9000-da6b9fff : forcedeth
  da6ba000-da6bafff : 0000:00:08.0
    da6ba000-da6bafff : forcedeth
  da6bb000-da6bbfff : 0000:00:05.2
    da6bb000-da6bbfff : sata_nv
  da6bc000-da6bcfff : 0000:00:05.1
    da6bc000-da6bcfff : sata_nv
  da6bd000-da6bdfff : 0000:00:05.0
    da6bd000-da6bdfff : sata_nv
  da6be000-da6be0ff : 0000:00:09.0
    da6be000-da6be0ff : forcedeth
  da6be400-da6be40f : 0000:00:08.0
    da6be400-da6be40f : forcedeth
  da6be800-da6be8ff : 0000:00:08.0
    da6be800-da6be8ff : forcedeth
  da6bec00-da6becff : 0000:00:02.1
    da6bec00-da6becff : ehci_hcd
  da6bf000-da6bffff : 0000:00:02.0
    da6bf000-da6bffff : ohci_hcd
  da6c0000-da6fffff : pnp 00:05
  da700000-daffffff : PCI Bus 0000:01
    da7e0000-da7fffff : 0000:01:05.0
    da800000-daffffff : 0000:01:05.0
  db000000-dfbfffff : PCI Bus 0000:02
    db000000-dbffffff : 0000:02:00.3
    dc000000-dcffffff : 0000:02:00.2
    dd000000-ddffffff : 0000:02:00.1
      dd000000-ddffffff : niu
    de000000-deffffff : 0000:02:00.0
      de000000-deffffff : niu
    df700000-df7fffff : 0000:02:00.3
    df800000-df8fffff : 0000:02:00.2
    df900000-df9fffff : 0000:02:00.1
    dfa00000-dfafffff : 0000:02:00.0
    dfbc0000-dfbc7fff : 0000:02:00.3
    dfbc8000-dfbcffff : 0000:02:00.3
    dfbd0000-dfbd7fff : 0000:02:00.2
    dfbd8000-dfbdffff : 0000:02:00.2
    dfbe0000-dfbe7fff : 0000:02:00.1
      dfbe0000-dfbe7fff : niu
    dfbe8000-dfbeffff : 0000:02:00.1
      dfbe8000-dfbeffff : niu
    dfbf0000-dfbf7fff : 0000:02:00.0
      dfbf0000-dfbf7fff : niu
    dfbf8000-dfbfffff : 0000:02:00.0
      dfbf8000-dfbfffff : niu
  dfc00000-dfcfffff : PCI Bus 0000:03
    dfc00000-dfcfffff : 0000:03:00.0
  dfd00000-dfffffff : PCI Bus 0000:04
    dfd80000-dfdfffff : 0000:04:00.0
    dfe00000-dfffffff : 0000:04:00.0
e0000000-efffffff : PCI MMCONFIG 0 [00-ff]
  e0000000-efffffff : reserved
    e0000000-efffffff : pnp 00:0b
f0000000-ffffffff : PCI Bus #00
  fec00000-fec00fff : IOAPIC 0
    fec00000-fec00fff : reserved
      fec00000-fec00fff : pnp 00:0a
  fed00000-fed003ff : HPET 0
  fee00000-feefffff : reserved
    fee00000-fee00fff : Local APIC
      fee00000-fee00fff : pnp 00:0a
    fee01000-feefffff : pnp 00:05
  ff700000-ffffffff : reserved
100000000-2047ffffff : System RAM
2048000000-fcffffffff : PCI Bus #00
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
H. Peter Anvin April 17, 2009, 10:04 p.m. UTC | #5
Ingo Molnar wrote:
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
>> On Fri, 17 Apr 2009, Ingo Molnar wrote:
>>> Could we perhaps round up to 1MB in this case too?
>> (The below 1MB one).
>>
>> I'd argue against it, at least in this incarnation. I can well 
>> imagine somebody wanting to do resource management in the 640k-1M 
>> window, so..
> 
> ok - indeed - if there's some super-small system with limited 
> address lines and all physical addresses tightly packed with RAM?
> 

No, much more likely that you're having PCI 2.x or PnP devices which 
have 20-bit resources.  It's probably worth noting that at least right 
now, Linux mishandles 20-bit BARs and treat them like 32-bit BARs.  It 
turns out to actually work on a majority of the (quite few) known 
devices which do have 20-bit BARs.

> 
> BIOS-provided physical RAM map:
>  BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)       0.639 MB RAM
>  BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)     0.001 MB
>                                                 [ hole ]       0.250 MB
>  BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)     0.125 MB
>  BIOS-e820: 0000000000100000 - 000000003ed94000 (usable)    1004.5   MB RAM
>  BIOS-e820: 000000003ed94000 - 000000003ee4e000 (ACPI NVS)     0.7   MB
>  BIOS-e820: 000000003ee4e000 - 000000003fea2000 (usable)      16.3   MB RAM
>  BIOS-e820: 000000003fea2000 - 000000003fee9000 (ACPI NVS)     0.3   MB
>  BIOS-e820: 000000003fee9000 - 000000003feed000 (usable)       0.15  MB RAM
>  BIOS-e820: 000000003feed000 - 000000003feff000 (ACPI data     0.07  MB
>  BIOS-e820: 000000003feff000 - 000000003ff00000 (usable)       0.004 MB RAM
>                                                 [ hole ]       1.0   MB
>                                                 [ hole ]    3072.0   MB
> 
> On this map, using your scheme, we'd fill up that small 1MB hole up 
> to 1GB [mockup]:
> 
>  BIOS-e820: 000000003ff00000 - 0000000040000000 (RAM buffer)
> 
> I guess that's a good thing not just for robustness: a chipset might 
> be faster when DMA or mmio is on some well-isolated physical memory 
> range, not too close to real RAM or other devices?
> 

Realistically, there probably is RAM there, probably consumed by the SMM 
T-seg.

	-hpa
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index ef2c356..e8b8d33 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1370,6 +1370,23 @@  void __init e820_reserve_resources(void)
 	}
 }
 
+/* How much should we pad RAM ending depending on where it is? */
+static unsigned long ram_alignment(resource_size_t pos)
+{
+	unsigned long mb = pos >> 20;
+
+	/* To 64kB in the first megabyte */
+	if (!mb)
+		return 64*1024;
+
+	/* To 1MB in the first 16MB */
+	if (mb < 16)
+		return 1024*1024;
+
+	/* To 32MB for anything above that */
+	return 32*1024*1024;
+}
+
 void __init e820_reserve_resources_late(void)
 {
 	int i;
@@ -1381,6 +1398,23 @@  void __init e820_reserve_resources_late(void)
 			insert_resource_expand_to_fit(&iomem_resource, res);
 		res++;
 	}
+
+	/*
+	 * Try to bump up RAM regions to reasonable boundaries to
+	 * avoid stolen RAM
+	 */
+	for (i = 0; i < e820.nr_map; i++) {
+		struct e820entry *entry = &e820_saved.map[i];
+		resource_size_t start, end;
+
+		if (entry->type != E820_RAM)
+			continue;
+		start = entry->addr + entry->size;
+		end = round_up(start, ram_alignment(start));
+		if (start == end)
+			continue;
+		reserve_region_with_split(&iomem_resource, start, end, "RAM buffer");
+	}
 }
 
 char *__init default_machine_specific_memory_setup(void)