Message ID | 1475106529-17443-3-git-send-email-konrad.wilk@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
>>> On 29.09.16 at 01:48, <konrad.wilk@oracle.com> wrote: > @@ -265,11 +266,30 @@ void pci_setup(void) > bars[i].devfn = devfn; > bars[i].bar_reg = bar_reg; > bars[i].bar_sz = bar_sz; > + bars[i].above_4gb = false; > > if ( ((bar_data & PCI_BASE_ADDRESS_SPACE) == > PCI_BASE_ADDRESS_SPACE_MEMORY) || > (bar_reg == PCI_ROM_ADDRESS) ) > - mmio_total += bar_sz; > + { > + /* > + * If bigger than 2GB minus emulated devices BAR space and > + * APIC space, then don't try to put under 4GB. > + */ > + if ( is_64bar && (mmio_total >= GB(2) || bar_sz >= > + (GB(2) - HVM_BELOW_4G_MMIO_LENGTH - mmio_total)) ) As mentioned in the reply to your earlier mail already, the subtraction of mmio_total here is risking wrap through zero (the >= GB(2) check doesn't fully guard against that). Furthermore you're now making behavior dependent on the order devices appear on the bus: The same device appearing early may get its BAR placed below 4Gb whereas when it appears late, it'll get placed high. IOW I think this needs further refinement: We should in a first pass place only 32-bit BARs. In a second pass we can then see which 64-bit BARs still fit (and I think we then ought to prefer small ones). Which means we should presumably account 32- and 64-bit BARs here independent of any other considerations, deferring the decision which 64-bit ones to place low until after this first pass. Jan
On Thu, Sep 29, 2016 at 01:03:02AM -0600, Jan Beulich wrote: > >>> On 29.09.16 at 01:48, <konrad.wilk@oracle.com> wrote: > > @@ -265,11 +266,30 @@ void pci_setup(void) > > bars[i].devfn = devfn; > > bars[i].bar_reg = bar_reg; > > bars[i].bar_sz = bar_sz; > > + bars[i].above_4gb = false; > > > > if ( ((bar_data & PCI_BASE_ADDRESS_SPACE) == > > PCI_BASE_ADDRESS_SPACE_MEMORY) || > > (bar_reg == PCI_ROM_ADDRESS) ) > > - mmio_total += bar_sz; > > + { > > + /* > > + * If bigger than 2GB minus emulated devices BAR space and > > + * APIC space, then don't try to put under 4GB. > > + */ > > + if ( is_64bar && (mmio_total >= GB(2) || bar_sz >= > > + (GB(2) - HVM_BELOW_4G_MMIO_LENGTH - mmio_total)) ) > > As mentioned in the reply to your earlier mail already, the > subtraction of mmio_total here is risking wrap through zero (the > >= GB(2) check doesn't fully guard against that). I am still waking up so bear with me, but is the reason the mmio_total >= GB(2) check does not guard is because the compiler may choose to execute _both_ parts of the '||' conditional (or swap them and execute the 'mmio_total >= GB(2)' second)? [Not that I had seen it looking at the assembler output, but that maybe because I had only seen -O2 output and not -O1?] > > Furthermore you're now making behavior dependent on the order > devices appear on the bus: The same device appearing early may > get its BAR placed below 4Gb whereas when it appears late, it'll > get placed high. IOW I think this needs further refinement: We > should in a first pass place only 32-bit BARs. In a second pass we > can then see which 64-bit BARs still fit (and I think we then ought > to prefer small ones). Which means we should presumably account > 32- and 64-bit BARs here independent of any other considerations, > deferring the decision which 64-bit ones to place low until after this > first pass. Ok, that is going to require some surgery and movement of code to add some functions in that giant piece of code. Expect more patches next week (or would it be easier if I just sent them out for the next release considering the amount of patches that are floating this week that need review?) > > Jan >
>>> On 29.09.16 at 11:23, <konrad.wilk@oracle.com> wrote: > On Thu, Sep 29, 2016 at 01:03:02AM -0600, Jan Beulich wrote: >> >>> On 29.09.16 at 01:48, <konrad.wilk@oracle.com> wrote: >> > @@ -265,11 +266,30 @@ void pci_setup(void) >> > bars[i].devfn = devfn; >> > bars[i].bar_reg = bar_reg; >> > bars[i].bar_sz = bar_sz; >> > + bars[i].above_4gb = false; >> > >> > if ( ((bar_data & PCI_BASE_ADDRESS_SPACE) == >> > PCI_BASE_ADDRESS_SPACE_MEMORY) || >> > (bar_reg == PCI_ROM_ADDRESS) ) >> > - mmio_total += bar_sz; >> > + { >> > + /* >> > + * If bigger than 2GB minus emulated devices BAR space and >> > + * APIC space, then don't try to put under 4GB. >> > + */ >> > + if ( is_64bar && (mmio_total >= GB(2) || bar_sz >= >> > + (GB(2) - HVM_BELOW_4G_MMIO_LENGTH - mmio_total)) ) >> >> As mentioned in the reply to your earlier mail already, the >> subtraction of mmio_total here is risking wrap through zero (the >> >= GB(2) check doesn't fully guard against that). > > I am still waking up so bear with me, but is the reason the mmio_total >>= GB(2) check does not guard is because the compiler may choose > to execute _both_ parts of the '||' conditional (or swap them and > execute the 'mmio_total >= GB(2)' second)? No, it's because you subtract more than just mmio_total from GB(2). >> Furthermore you're now making behavior dependent on the order >> devices appear on the bus: The same device appearing early may >> get its BAR placed below 4Gb whereas when it appears late, it'll >> get placed high. IOW I think this needs further refinement: We >> should in a first pass place only 32-bit BARs. In a second pass we >> can then see which 64-bit BARs still fit (and I think we then ought >> to prefer small ones). Which means we should presumably account >> 32- and 64-bit BARs here independent of any other considerations, >> deferring the decision which 64-bit ones to place low until after this >> first pass. > > Ok, that is going to require some surgery and movement of code to add > some functions in that giant piece of code. Expect more patches next > week (or would it be easier if I just sent them out for the next release > considering the amount of patches that are floating this week that need > review?) Well, I would view this as a bug fix, so it might still be allowed in. Ask Wei if in doubt. Jan
On Thu, Sep 29, 2016 at 03:36:00AM -0600, Jan Beulich wrote: > >>> On 29.09.16 at 11:23, <konrad.wilk@oracle.com> wrote: > > On Thu, Sep 29, 2016 at 01:03:02AM -0600, Jan Beulich wrote: > >> >>> On 29.09.16 at 01:48, <konrad.wilk@oracle.com> wrote: > >> > @@ -265,11 +266,30 @@ void pci_setup(void) > >> > bars[i].devfn = devfn; > >> > bars[i].bar_reg = bar_reg; > >> > bars[i].bar_sz = bar_sz; > >> > + bars[i].above_4gb = false; > >> > > >> > if ( ((bar_data & PCI_BASE_ADDRESS_SPACE) == > >> > PCI_BASE_ADDRESS_SPACE_MEMORY) || > >> > (bar_reg == PCI_ROM_ADDRESS) ) > >> > - mmio_total += bar_sz; > >> > + { > >> > + /* > >> > + * If bigger than 2GB minus emulated devices BAR space and > >> > + * APIC space, then don't try to put under 4GB. > >> > + */ > >> > + if ( is_64bar && (mmio_total >= GB(2) || bar_sz >= > >> > + (GB(2) - HVM_BELOW_4G_MMIO_LENGTH - mmio_total)) ) > >> > >> As mentioned in the reply to your earlier mail already, the > >> subtraction of mmio_total here is risking wrap through zero (the > >> >= GB(2) check doesn't fully guard against that). > > > > I am still waking up so bear with me, but is the reason the mmio_total > >>= GB(2) check does not guard is because the compiler may choose > > to execute _both_ parts of the '||' conditional (or swap them and > > execute the 'mmio_total >= GB(2)' second)? > > No, it's because you subtract more than just mmio_total from GB(2). > > >> Furthermore you're now making behavior dependent on the order > >> devices appear on the bus: The same device appearing early may > >> get its BAR placed below 4Gb whereas when it appears late, it'll > >> get placed high. IOW I think this needs further refinement: We > >> should in a first pass place only 32-bit BARs. In a second pass we > >> can then see which 64-bit BARs still fit (and I think we then ought > >> to prefer small ones). Which means we should presumably account > >> 32- and 64-bit BARs here independent of any other considerations, > >> deferring the decision which 64-bit ones to place low until after this > >> first pass. > > > > Ok, that is going to require some surgery and movement of code to add > > some functions in that giant piece of code. Expect more patches next > > week (or would it be easier if I just sent them out for the next release > > considering the amount of patches that are floating this week that need > > review?) > > Well, I would view this as a bug fix, so it might still be allowed in. > Ask Wei if in doubt. > Before RC1, sure. After we cut RCs, anything that changes memory layout of the guests need to be considered carefully. Wei. > Jan >
On Thu, Sep 29, 2016 at 11:48:43AM +0100, Wei Liu wrote: > On Thu, Sep 29, 2016 at 03:36:00AM -0600, Jan Beulich wrote: > > >>> On 29.09.16 at 11:23, <konrad.wilk@oracle.com> wrote: > > > On Thu, Sep 29, 2016 at 01:03:02AM -0600, Jan Beulich wrote: > > >> >>> On 29.09.16 at 01:48, <konrad.wilk@oracle.com> wrote: > > >> > @@ -265,11 +266,30 @@ void pci_setup(void) > > >> > bars[i].devfn = devfn; > > >> > bars[i].bar_reg = bar_reg; > > >> > bars[i].bar_sz = bar_sz; > > >> > + bars[i].above_4gb = false; > > >> > > > >> > if ( ((bar_data & PCI_BASE_ADDRESS_SPACE) == > > >> > PCI_BASE_ADDRESS_SPACE_MEMORY) || > > >> > (bar_reg == PCI_ROM_ADDRESS) ) > > >> > - mmio_total += bar_sz; > > >> > + { > > >> > + /* > > >> > + * If bigger than 2GB minus emulated devices BAR space and > > >> > + * APIC space, then don't try to put under 4GB. > > >> > + */ > > >> > + if ( is_64bar && (mmio_total >= GB(2) || bar_sz >= > > >> > + (GB(2) - HVM_BELOW_4G_MMIO_LENGTH - mmio_total)) ) > > >> > > >> As mentioned in the reply to your earlier mail already, the > > >> subtraction of mmio_total here is risking wrap through zero (the > > >> >= GB(2) check doesn't fully guard against that). > > > > > > I am still waking up so bear with me, but is the reason the mmio_total > > >>= GB(2) check does not guard is because the compiler may choose > > > to execute _both_ parts of the '||' conditional (or swap them and > > > execute the 'mmio_total >= GB(2)' second)? > > > > No, it's because you subtract more than just mmio_total from GB(2). > > > > >> Furthermore you're now making behavior dependent on the order > > >> devices appear on the bus: The same device appearing early may > > >> get its BAR placed below 4Gb whereas when it appears late, it'll > > >> get placed high. IOW I think this needs further refinement: We > > >> should in a first pass place only 32-bit BARs. In a second pass we > > >> can then see which 64-bit BARs still fit (and I think we then ought > > >> to prefer small ones). Which means we should presumably account > > >> 32- and 64-bit BARs here independent of any other considerations, > > >> deferring the decision which 64-bit ones to place low until after this > > >> first pass. > > > > > > Ok, that is going to require some surgery and movement of code to add > > > some functions in that giant piece of code. Expect more patches next > > > week (or would it be easier if I just sent them out for the next release > > > considering the amount of patches that are floating this week that need > > > review?) > > > > Well, I would view this as a bug fix, so it might still be allowed in. > > Ask Wei if in doubt. > > > > Before RC1, sure. After we cut RCs, anything that changes memory layout > of the guests need to be considered carefully. OK. I will try my best to get it done before then. But if I fail we can always look at this for the next release. (Along with other patches that I need to redo).
diff --git a/tools/firmware/hvmloader/pci.c b/tools/firmware/hvmloader/pci.c index 416829d..f6194b9 100644 --- a/tools/firmware/hvmloader/pci.c +++ b/tools/firmware/hvmloader/pci.c @@ -80,7 +80,7 @@ void pci_setup(void) { uint8_t is_64bar, using_64bar, bar64_relocate = 0; uint32_t devfn, bar_reg, cmd, bar_data, bar_data_upper; - uint64_t base, bar_sz, bar_sz_upper, mmio_total = 0; + uint64_t base, bar_sz, bar_sz_upper, mmio_total = 0, mmio_64bit_total = 0; uint32_t vga_devfn = 256; uint16_t class, vendor_id, device_id; unsigned int bar, pin, link, isa_irq; @@ -97,6 +97,7 @@ void pci_setup(void) uint32_t devfn; uint32_t bar_reg; uint64_t bar_sz; + bool above_4gb; } *bars = (struct bars *)scratch_start; unsigned int i, nr_bars = 0; uint64_t mmio_hole_size = 0; @@ -265,11 +266,30 @@ void pci_setup(void) bars[i].devfn = devfn; bars[i].bar_reg = bar_reg; bars[i].bar_sz = bar_sz; + bars[i].above_4gb = false; if ( ((bar_data & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_MEMORY) || (bar_reg == PCI_ROM_ADDRESS) ) - mmio_total += bar_sz; + { + /* + * If bigger than 2GB minus emulated devices BAR space and + * APIC space, then don't try to put under 4GB. + */ + if ( is_64bar && (mmio_total >= GB(2) || bar_sz >= + (GB(2) - HVM_BELOW_4G_MMIO_LENGTH - mmio_total)) ) + { + mmio_64bit_total += bar_sz; + bars[i].above_4gb = true; + /* + * As this may not trigger now that mmio_total could be + * less than 2GB, so force it. + */ + bar64_relocate = 1; + } + else + mmio_total += bar_sz; + } nr_bars++; @@ -349,7 +369,7 @@ void pci_setup(void) pci_mem_start = hvm_info->low_mem_pgend << PAGE_SHIFT; } - if ( mmio_total > (pci_mem_end - pci_mem_start) ) + if ( mmio_total > (pci_mem_end - pci_mem_start) || bar64_relocate ) { printf("Low MMIO hole not large enough for all devices," " relocating some BARs to 64-bit\n"); @@ -431,7 +451,7 @@ void pci_setup(void) * Should either of those two conditions change, this code will break. */ using_64bar = bars[i].is_64bar && bar64_relocate - && (mmio_total > (mem_resource.max - mem_resource.base)); + && ((mmio_total + mmio_64bit_total) > (mem_resource.max - mem_resource.base)); bar_data = pci_readl(devfn, bar_reg); if ( (bar_data & PCI_BASE_ADDRESS_SPACE) == @@ -451,7 +471,10 @@ void pci_setup(void) resource = &mem_resource; bar_data &= ~PCI_BASE_ADDRESS_MEM_MASK; } - mmio_total -= bar_sz; + if ( bars[i].above_4gb ) + mmio_64bit_total -= bar_sz; + else + mmio_total -= bar_sz; } else {
Where ~2GB is actually 2GB minus MMIO space used for emulated devices and minus APIC space. There is no point. We can try to balloon out the memory between hvm_info->low_mem_pgend to pci_mem_end and we will never be able have a hole big enough for 2GB MMIO. As we can't go lower than 0x80000000 and can't go above 0xF0000000 which effectively leaves us with 1792MB of MMIO space (-32MB for VGA and -16 for platform device, so in reality: 1744MB). Instead just let it go above 4GB in the 64-bit zone. Note that prior to this patch the hvmloader would relocate as much memory as it could under 4GB: Low MMIO hole not large enough for all devices, relocating some BARs to 64-bit Relocating 0xffff pages from 0e0001000 to 210000000 for lowmem MMIO hole Relocating 0xffff pages from 0d0002000 to 21ffff000 for lowmem MMIO hole Relocating 0xffff pages from 0c0003000 to 22fffe000 for lowmem MMIO hole Relocating 0xffff pages from 0b0004000 to 23fffd000 for lowmem MMIO hole Relocating 0xffff pages from 0a0005000 to 24fffc000 for lowmem MMIO hole Relocating 0xffff pages from 090006000 to 25fffb000 for lowmem MMIO hole Relocating 0xffff pages from 080007000 to 26fffa000 for lowmem MMIO hole Relocating 0x7 pages from 080000000 to 27fff9000 for lowmem MMIO hole which is completely pointless. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- Cc: Jan Beulich <jbeulich@suse.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Wei Liu <wei.liu2@citrix.com> v1: First submission. v2: Check for up to 1744MB. - Use defines for 4GB, 2GB. - Fix style guide issues. --- tools/firmware/hvmloader/pci.c | 33 ++++++++++++++++++++++++++++----- 1 file changed, 28 insertions(+), 5 deletions(-)