Message ID | alpine.LFD.2.00.0904161540590.4042@localhost.localdomain (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
* Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Thu, 16 Apr 2009, Yinghai Lu wrote: > > > > please check. > > > > [PATCH] x86/pci: make pci_mem_start to be aligned only -v4 > > I like the approach. That said, I think that rather than do the "modify > the e820 array" thing, why not just do it in the in the resource tree, and > do it at "e820_reserve_resources_late()" time? > > IOW, something like this. > > TOTALLY UNTESTED! The point is to take all RAM resources we haev, and > _after_ we've added all the resources we've seen in the E820 tree, we then > _also_ try to add fake reserved entries for any "round up to X" at the end > of the RAM resources. > > NOTE! I really didn't want to use "reserve_region_with_split()". I didn't > want to recurse into any conflicting resources, I really wanted to just do > the other failure cases. > > THIS PATCH IS NOT MEANT TO BE USED. Just a rough "almost like this" kind > of thing. That includes the rough draft of how much to round things up to > based on where the end of RAM region is etc. I'm really throwing this out > more as a "wouldn't this be a readable way to handle any missing reserved > entries" kind of thing.. > > Linus > > --- > arch/x86/kernel/e820.c | 34 ++++++++++++++++++++++++++++++++++ > 1 files changed, 34 insertions(+), 0 deletions(-) > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c > index ef2c356..e8b8d33 100644 > --- a/arch/x86/kernel/e820.c > +++ b/arch/x86/kernel/e820.c > @@ -1370,6 +1370,23 @@ void __init e820_reserve_resources(void) > } > } > > +/* How much should we pad RAM ending depending on where it is? */ > +static unsigned long ram_alignment(resource_size_t pos) > +{ > + unsigned long mb = pos >> 20; > + > + /* To 64kB in the first megabyte */ > + if (!mb) > + return 64*1024; Could we perhaps round up to 1MB in this case too? This would mean that we would never dynamically allocate into the 640k..1MB hole - but even if it's free it's probably a good idea to avoid that range - it's usually quite special. So we'd just have two ganularities for round-up: 1MB and 32MB. Nice, predictable scheme. > + > + /* To 1MB in the first 16MB */ > + if (mb < 16) > + return 1024*1024; > + > + /* To 32MB for anything above that */ > + return 32*1024*1024; > +} > + > void __init e820_reserve_resources_late(void) > { > int i; > @@ -1381,6 +1398,23 @@ void __init e820_reserve_resources_late(void) > insert_resource_expand_to_fit(&iomem_resource, res); > res++; > } > + > + /* > + * Try to bump up RAM regions to reasonable boundaries to > + * avoid stolen RAM > + */ > + for (i = 0; i < e820.nr_map; i++) { > + struct e820entry *entry = &e820_saved.map[i]; > + resource_size_t start, end; > + > + if (entry->type != E820_RAM) > + continue; Would it make sense to round up everything that is listed in the E820 map? Just in case the BIOS is not entirely honest about the true extent of that area. It might even make sense to add a small 'guard' area next to such ranges (even if they are aligned well): to prevent the BIOS from accidentally overruning into a device BAR we allocate next to it. > + start = entry->addr + entry->size; > + end = round_up(start, ram_alignment(start)); > + if (start == end) > + continue; Hm, indeed, the continue is needed - reserve_region_with_split() lets zero-sized resources be inserted silently. I'd have missed this case. Do zero-sized memory resources have a special role somewhere? [ Plus i dont see any protection against negative-size resources in kernel/resource.c either. OTOH inserting a negative size resource just locks down the tree for all future resources, so it would be noticed for sure. Zero-size is not an issue it appears, it goes in and prevents only the exactly same-position zero-size resource to be inserted. But it might sense to silently ignore zero-size resources (if we dont rely on them elsewhere), or to WARN() about them and about negative size resources. ] > + reserve_region_with_split(&iomem_resource, start, end, "RAM buffer"); Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 17 Apr 2009, Ingo Molnar wrote: > > Could we perhaps round up to 1MB in this case too? (The below 1MB one). I'd argue against it, at least in this incarnation. I can well imagine somebody wanting to do resource management in the 640k-1M window, so.. > Would it make sense to round up everything that is listed in the > E820 map? Just in case the BIOS is not entirely honest about the > true extent of that area. Well, it would probably work, but on the other hand, when we see "E820_RAM", that means that we really _can_ trust that that E820 entry is right, since we're going to use it as RAM (and Windows would too), and if it wasn't RAM, really bad things would happen. So E820_RAM is a _lot_ more trustworthy than the other cases. If we're rounding up by reasonably large amounts like 32MB or even more, I really think we should do so for the things we really know are there, and that we really fundamentally know come in big granularities. The other entries in the e820 map can reasonably be 4kB or something, because they are an IO-APIC or whatever. I can't say that I'd feel happy putting a guard area around something like that. But RAM? Sure, it can come in 384kB chunks (think RAM remapping for the low 1MB area), but it doesn't tend to happen when we're talking gigs any more. > > + start = entry->addr + entry->size; > > + end = round_up(start, ram_alignment(start)); > > + if (start == end) > > + continue; > > Hm, indeed, the continue is needed - reserve_region_with_split() > lets zero-sized resources be inserted silently. I'd have missed this > case. Do zero-sized memory resources have a special role somewhere? No. But it wouldn't be a zero-size region, it would be a one-byte sized region. It's just that my patch was missing the "-1" from the end that I meant to put there: > > + reserve_region_with_split(&iomem_resource, start, end, "RAM buffer"); That 'end' there should be 'end-1', and that also explains why "start == end" must have a continue. The 'end' in a resource region is the last byte, not the 'byte after'. So there was a small buglet in the patch, but as I mentioned, using "reserve_region_with_split()" is really wrong anyway, because we do not want to recurse into existing regions, just split _around_ them. So the patch was meant as Linus -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, 17 Apr 2009, Ingo Molnar wrote: > > > > Could we perhaps round up to 1MB in this case too? > > (The below 1MB one). > > I'd argue against it, at least in this incarnation. I can well > imagine somebody wanting to do resource management in the 640k-1M > window, so.. ok - indeed - if there's some super-small system with limited address lines and all physical addresses tightly packed with RAM? > > Would it make sense to round up everything that is listed in the > > E820 map? Just in case the BIOS is not entirely honest about the > > true extent of that area. > > Well, it would probably work, but on the other hand, when we see > "E820_RAM", that means that we really _can_ trust that that E820 > entry is right, since we're going to use it as RAM (and Windows > would too), and if it wasn't RAM, really bad things would happen. > > So E820_RAM is a _lot_ more trustworthy than the other cases. If > we're rounding up by reasonably large amounts like 32MB or even > more, I really think we should do so for the things we really know > are there, and that we really fundamentally know come in big > granularities. > > The other entries in the e820 map can reasonably be 4kB or > something, because they are an IO-APIC or whatever. I can't say > that I'd feel happy putting a guard area around something like > that. But RAM? Sure, it can come in 384kB chunks (think RAM > remapping for the low 1MB area), but it doesn't tend to happen > when we're talking gigs any more. One of my systems is a bit weird, with such a checkered RAM map: BIOS-provided physical RAM map: BIOS-e820: 0000000000000000 - 000000000009fc00 (usable) 0.639 MB RAM BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved) 0.001 MB [ hole ] 0.250 MB BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved) 0.125 MB BIOS-e820: 0000000000100000 - 000000003ed94000 (usable) 1004.5 MB RAM BIOS-e820: 000000003ed94000 - 000000003ee4e000 (ACPI NVS) 0.7 MB BIOS-e820: 000000003ee4e000 - 000000003fea2000 (usable) 16.3 MB RAM BIOS-e820: 000000003fea2000 - 000000003fee9000 (ACPI NVS) 0.3 MB BIOS-e820: 000000003fee9000 - 000000003feed000 (usable) 0.15 MB RAM BIOS-e820: 000000003feed000 - 000000003feff000 (ACPI data 0.07 MB BIOS-e820: 000000003feff000 - 000000003ff00000 (usable) 0.004 MB RAM [ hole ] 1.0 MB [ hole ] 3072.0 MB On this map, using your scheme, we'd fill up that small 1MB hole up to 1GB [mockup]: BIOS-e820: 000000003ff00000 - 0000000040000000 (RAM buffer) I guess that's a good thing not just for robustness: a chipset might be faster when DMA or mmio is on some well-isolated physical memory range, not too close to real RAM or other devices? Bits of the low hole: 00000000-0009fbff : System RAM 0009fc00-0009ffff : reserved 000c0000-000dffff : pnp 00:01 000e0000-000fffff : reserved 00100000-3ed93fff : System RAM would still be available to dynamic PCI resources - as the 64K rounding would leave it alone. Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
while testing linus's patch, found some resource name get corrupted... even bare tip cat /proc/iomem 00000000-000973ff : System RAM 00097400-0009ffff : reserved 000a0000-000bffff : PCI Bus #00 000c0000-000cffff : pnp 00:0c 000e0000-000fffff : reserved 00100000-b7f9ffff : System RAM 00200000-00c67b4b : Kernel code 00c67b4c-01331edf : Kernel data 015a5000-01fc9657 : Kernel bss 20000000-23ffffff : GART b7fae000-b7faffff : System RAM b7fb0000-b7fbdfff : ACPI Tables b7fbe000-b7feffff : ACPI Non-volatile Storage b7ff0000-b7ffffff : reserved b8000000-beffffff : PCI Bus #00 bf000000-bfffffff : PCI Bus #80 bfe80000-bfebffff : pnp 00:0e bfef9000-bfef9fff : bfef9000-bfef9fff : forcedeth bfefa000-bfefa00f : bfefa000-bfefa00f : forcedeth bfefa400-bfefa4ff : bfefa400-bfefa4ff : forcedeth bfefa800-bfefa80f : bfefa800-bfefa80f : forcedeth bfefac00-bfefacff : bfefac00-bfefacff : forcedeth bfefb000-bfefbfff : bfefb000-bfefbfff : forcedeth bfefc000-bfefcfff : bfefc000-bfefcfff : sata_nv bfefd000-bfefdfff : bfefd000-bfefdfff : sata_nv bfefe000-bfefefff : bfefe000-bfefefff : sata_nv bfeff000-bfefffff : IOAPIC 1 bfeff000-bfefffff : bff00000-bfffffff : PCI Bus 0000:83 bff80000-bffbffff : 0000:83:00.0 bfffec00-bfffecff : 0000:83:00.0 bfffec00-bfffecff : lpfc bffff000-bfffffff : 0000:83:00.0 bffff000-bfffffff : lpfc c0000000-dfffffff : PCI Bus #00 c0000000-d9ffffff : PCI Bus 0000:03 c0000000-cfffffff : 0000:03:00.0 d9800000-d9ffffff : 0000:03:00.0 da6b8c00-da6b8c0f : 0000:00:09.0 da6b8c00-da6b8c0f : forcedeth da6b9000-da6b9fff : 0000:00:09.0 da6b9000-da6b9fff : forcedeth da6ba000-da6bafff : 0000:00:08.0 da6ba000-da6bafff : forcedeth da6bb000-da6bbfff : 0000:00:05.2 da6bb000-da6bbfff : sata_nv da6bc000-da6bcfff : 0000:00:05.1 da6bc000-da6bcfff : sata_nv da6bd000-da6bdfff : 0000:00:05.0 da6bd000-da6bdfff : sata_nv da6be000-da6be0ff : 0000:00:09.0 da6be000-da6be0ff : forcedeth da6be400-da6be40f : 0000:00:08.0 da6be400-da6be40f : forcedeth da6be800-da6be8ff : 0000:00:08.0 da6be800-da6be8ff : forcedeth da6bec00-da6becff : 0000:00:02.1 da6bec00-da6becff : ehci_hcd da6bf000-da6bffff : 0000:00:02.0 da6bf000-da6bffff : ohci_hcd da6c0000-da6fffff : pnp 00:05 da700000-daffffff : PCI Bus 0000:01 da7e0000-da7fffff : 0000:01:05.0 da800000-daffffff : 0000:01:05.0 db000000-dfbfffff : PCI Bus 0000:02 db000000-dbffffff : 0000:02:00.3 dc000000-dcffffff : 0000:02:00.2 dd000000-ddffffff : 0000:02:00.1 dd000000-ddffffff : niu de000000-deffffff : 0000:02:00.0 de000000-deffffff : niu df700000-df7fffff : 0000:02:00.3 df800000-df8fffff : 0000:02:00.2 df900000-df9fffff : 0000:02:00.1 dfa00000-dfafffff : 0000:02:00.0 dfbc0000-dfbc7fff : 0000:02:00.3 dfbc8000-dfbcffff : 0000:02:00.3 dfbd0000-dfbd7fff : 0000:02:00.2 dfbd8000-dfbdffff : 0000:02:00.2 dfbe0000-dfbe7fff : 0000:02:00.1 dfbe0000-dfbe7fff : niu dfbe8000-dfbeffff : 0000:02:00.1 dfbe8000-dfbeffff : niu dfbf0000-dfbf7fff : 0000:02:00.0 dfbf0000-dfbf7fff : niu dfbf8000-dfbfffff : 0000:02:00.0 dfbf8000-dfbfffff : niu dfc00000-dfcfffff : PCI Bus 0000:03 dfc00000-dfcfffff : 0000:03:00.0 dfd00000-dfffffff : PCI Bus 0000:04 dfd80000-dfdfffff : 0000:04:00.0 dfe00000-dfffffff : 0000:04:00.0 e0000000-efffffff : PCI MMCONFIG 0 [00-ff] e0000000-efffffff : reserved e0000000-efffffff : pnp 00:0b f0000000-ffffffff : PCI Bus #00 fec00000-fec00fff : IOAPIC 0 fec00000-fec00fff : reserved fec00000-fec00fff : pnp 00:0a fed00000-fed003ff : HPET 0 fee00000-feefffff : reserved fee00000-fee00fff : Local APIC fee00000-fee00fff : pnp 00:0a fee01000-feefffff : pnp 00:05 ff700000-ffffffff : reserved 100000000-2047ffffff : System RAM 2048000000-fcffffffff : PCI Bus #00 -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ingo Molnar wrote: > * Linus Torvalds <torvalds@linux-foundation.org> wrote: > >> On Fri, 17 Apr 2009, Ingo Molnar wrote: >>> Could we perhaps round up to 1MB in this case too? >> (The below 1MB one). >> >> I'd argue against it, at least in this incarnation. I can well >> imagine somebody wanting to do resource management in the 640k-1M >> window, so.. > > ok - indeed - if there's some super-small system with limited > address lines and all physical addresses tightly packed with RAM? > No, much more likely that you're having PCI 2.x or PnP devices which have 20-bit resources. It's probably worth noting that at least right now, Linux mishandles 20-bit BARs and treat them like 32-bit BARs. It turns out to actually work on a majority of the (quite few) known devices which do have 20-bit BARs. > > BIOS-provided physical RAM map: > BIOS-e820: 0000000000000000 - 000000000009fc00 (usable) 0.639 MB RAM > BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved) 0.001 MB > [ hole ] 0.250 MB > BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved) 0.125 MB > BIOS-e820: 0000000000100000 - 000000003ed94000 (usable) 1004.5 MB RAM > BIOS-e820: 000000003ed94000 - 000000003ee4e000 (ACPI NVS) 0.7 MB > BIOS-e820: 000000003ee4e000 - 000000003fea2000 (usable) 16.3 MB RAM > BIOS-e820: 000000003fea2000 - 000000003fee9000 (ACPI NVS) 0.3 MB > BIOS-e820: 000000003fee9000 - 000000003feed000 (usable) 0.15 MB RAM > BIOS-e820: 000000003feed000 - 000000003feff000 (ACPI data 0.07 MB > BIOS-e820: 000000003feff000 - 000000003ff00000 (usable) 0.004 MB RAM > [ hole ] 1.0 MB > [ hole ] 3072.0 MB > > On this map, using your scheme, we'd fill up that small 1MB hole up > to 1GB [mockup]: > > BIOS-e820: 000000003ff00000 - 0000000040000000 (RAM buffer) > > I guess that's a good thing not just for robustness: a chipset might > be faster when DMA or mmio is on some well-isolated physical memory > range, not too close to real RAM or other devices? > Realistically, there probably is RAM there, probably consumed by the SMM T-seg. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c index ef2c356..e8b8d33 100644 --- a/arch/x86/kernel/e820.c +++ b/arch/x86/kernel/e820.c @@ -1370,6 +1370,23 @@ void __init e820_reserve_resources(void) } } +/* How much should we pad RAM ending depending on where it is? */ +static unsigned long ram_alignment(resource_size_t pos) +{ + unsigned long mb = pos >> 20; + + /* To 64kB in the first megabyte */ + if (!mb) + return 64*1024; + + /* To 1MB in the first 16MB */ + if (mb < 16) + return 1024*1024; + + /* To 32MB for anything above that */ + return 32*1024*1024; +} + void __init e820_reserve_resources_late(void) { int i; @@ -1381,6 +1398,23 @@ void __init e820_reserve_resources_late(void) insert_resource_expand_to_fit(&iomem_resource, res); res++; } + + /* + * Try to bump up RAM regions to reasonable boundaries to + * avoid stolen RAM + */ + for (i = 0; i < e820.nr_map; i++) { + struct e820entry *entry = &e820_saved.map[i]; + resource_size_t start, end; + + if (entry->type != E820_RAM) + continue; + start = entry->addr + entry->size; + end = round_up(start, ram_alignment(start)); + if (start == end) + continue; + reserve_region_with_split(&iomem_resource, start, end, "RAM buffer"); + } } char *__init default_machine_specific_memory_setup(void)