diff mbox

[v2,2/2] hvmloader, pci: Don't try to relocate memory if 64-bit BAR is bigger than ~2GB

Message ID 1475106529-17443-3-git-send-email-konrad.wilk@oracle.com (mailing list archive)
State New, archived
Headers show

Commit Message

Konrad Rzeszutek Wilk Sept. 28, 2016, 11:48 p.m. UTC
Where ~2GB is actually 2GB minus MMIO space used for emulated devices
and minus APIC space.

There is no point. We can try to balloon out the memory between
hvm_info->low_mem_pgend to pci_mem_end and we will never be able
have a hole big enough for 2GB MMIO. As we can't go lower than
0x80000000 and can't go above 0xF0000000 which effectively leaves
us with 1792MB of MMIO space (-32MB for VGA and -16 for platform
device, so in reality: 1744MB).

Instead just let it go above 4GB in the 64-bit zone.

Note that prior to this patch the hvmloader would relocate as much
memory as it could under 4GB:

Low MMIO hole not large enough for all devices, relocating some BARs to 64-bit
Relocating 0xffff pages from 0e0001000 to 210000000 for lowmem MMIO hole
Relocating 0xffff pages from 0d0002000 to 21ffff000 for lowmem MMIO hole
Relocating 0xffff pages from 0c0003000 to 22fffe000 for lowmem MMIO hole
Relocating 0xffff pages from 0b0004000 to 23fffd000 for lowmem MMIO hole
Relocating 0xffff pages from 0a0005000 to 24fffc000 for lowmem MMIO hole
Relocating 0xffff pages from 090006000 to 25fffb000 for lowmem MMIO hole
Relocating 0xffff pages from 080007000 to 26fffa000 for lowmem MMIO hole
Relocating 0x7 pages from 080000000 to 27fff9000 for lowmem MMIO hole

which is completely pointless.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>

v1: First submission.

v2: Check for up to 1744MB.
  - Use defines for 4GB, 2GB.
  - Fix style guide issues.
---
 tools/firmware/hvmloader/pci.c | 33 ++++++++++++++++++++++++++++-----
 1 file changed, 28 insertions(+), 5 deletions(-)

Comments

Jan Beulich Sept. 29, 2016, 7:03 a.m. UTC | #1
>>> On 29.09.16 at 01:48, <konrad.wilk@oracle.com> wrote:
> @@ -265,11 +266,30 @@ void pci_setup(void)
>              bars[i].devfn   = devfn;
>              bars[i].bar_reg = bar_reg;
>              bars[i].bar_sz  = bar_sz;
> +            bars[i].above_4gb = false;
>  
>              if ( ((bar_data & PCI_BASE_ADDRESS_SPACE) ==
>                    PCI_BASE_ADDRESS_SPACE_MEMORY) ||
>                   (bar_reg == PCI_ROM_ADDRESS) )
> -                mmio_total += bar_sz;
> +            {
> +                /*
> +                 * If bigger than 2GB minus emulated devices BAR space and
> +                 * APIC space, then don't try to put under 4GB.
> +                 */
> +                if ( is_64bar && (mmio_total >= GB(2) || bar_sz >=
> +                     (GB(2) - HVM_BELOW_4G_MMIO_LENGTH - mmio_total)) )

As mentioned in the reply to your earlier mail already, the
subtraction of mmio_total here is risking wrap through zero (the
>= GB(2) check doesn't fully guard against that).

Furthermore you're now making behavior dependent on the order
devices appear on the bus: The same device appearing early may
get its BAR placed below 4Gb whereas when it appears late, it'll
get placed high. IOW I think this needs further refinement: We
should in a first pass place only 32-bit BARs. In a second pass we
can then see which 64-bit BARs still fit (and I think we then ought
to prefer small ones). Which means we should presumably account
32- and 64-bit BARs here independent of any other considerations,
deferring the decision which 64-bit ones to place low until after this
first pass.

Jan
Konrad Rzeszutek Wilk Sept. 29, 2016, 9:23 a.m. UTC | #2
On Thu, Sep 29, 2016 at 01:03:02AM -0600, Jan Beulich wrote:
> >>> On 29.09.16 at 01:48, <konrad.wilk@oracle.com> wrote:
> > @@ -265,11 +266,30 @@ void pci_setup(void)
> >              bars[i].devfn   = devfn;
> >              bars[i].bar_reg = bar_reg;
> >              bars[i].bar_sz  = bar_sz;
> > +            bars[i].above_4gb = false;
> >  
> >              if ( ((bar_data & PCI_BASE_ADDRESS_SPACE) ==
> >                    PCI_BASE_ADDRESS_SPACE_MEMORY) ||
> >                   (bar_reg == PCI_ROM_ADDRESS) )
> > -                mmio_total += bar_sz;
> > +            {
> > +                /*
> > +                 * If bigger than 2GB minus emulated devices BAR space and
> > +                 * APIC space, then don't try to put under 4GB.
> > +                 */
> > +                if ( is_64bar && (mmio_total >= GB(2) || bar_sz >=
> > +                     (GB(2) - HVM_BELOW_4G_MMIO_LENGTH - mmio_total)) )
> 
> As mentioned in the reply to your earlier mail already, the
> subtraction of mmio_total here is risking wrap through zero (the
> >= GB(2) check doesn't fully guard against that).

I am still waking up so bear with me, but is the reason the mmio_total
>= GB(2) check does not guard is because the compiler may choose
to execute _both_ parts of the '||' conditional (or swap them and
execute the 'mmio_total >= GB(2)' second)?

[Not that I had seen it looking at the assembler output, but that maybe
because I had only seen -O2 output and not -O1?]
> 
> Furthermore you're now making behavior dependent on the order
> devices appear on the bus: The same device appearing early may
> get its BAR placed below 4Gb whereas when it appears late, it'll
> get placed high. IOW I think this needs further refinement: We
> should in a first pass place only 32-bit BARs. In a second pass we
> can then see which 64-bit BARs still fit (and I think we then ought
> to prefer small ones). Which means we should presumably account
> 32- and 64-bit BARs here independent of any other considerations,
> deferring the decision which 64-bit ones to place low until after this
> first pass.

Ok, that is going to require some surgery and movement of code to add
some functions in that giant piece of code. Expect more patches next
week (or would it be easier if I just sent them out for the next release
considering the amount of patches that are floating this week that need
review?)

> 
> Jan
>
Jan Beulich Sept. 29, 2016, 9:36 a.m. UTC | #3
>>> On 29.09.16 at 11:23, <konrad.wilk@oracle.com> wrote:
> On Thu, Sep 29, 2016 at 01:03:02AM -0600, Jan Beulich wrote:
>> >>> On 29.09.16 at 01:48, <konrad.wilk@oracle.com> wrote:
>> > @@ -265,11 +266,30 @@ void pci_setup(void)
>> >              bars[i].devfn   = devfn;
>> >              bars[i].bar_reg = bar_reg;
>> >              bars[i].bar_sz  = bar_sz;
>> > +            bars[i].above_4gb = false;
>> >  
>> >              if ( ((bar_data & PCI_BASE_ADDRESS_SPACE) ==
>> >                    PCI_BASE_ADDRESS_SPACE_MEMORY) ||
>> >                   (bar_reg == PCI_ROM_ADDRESS) )
>> > -                mmio_total += bar_sz;
>> > +            {
>> > +                /*
>> > +                 * If bigger than 2GB minus emulated devices BAR space and
>> > +                 * APIC space, then don't try to put under 4GB.
>> > +                 */
>> > +                if ( is_64bar && (mmio_total >= GB(2) || bar_sz >=
>> > +                     (GB(2) - HVM_BELOW_4G_MMIO_LENGTH - mmio_total)) )
>> 
>> As mentioned in the reply to your earlier mail already, the
>> subtraction of mmio_total here is risking wrap through zero (the
>> >= GB(2) check doesn't fully guard against that).
> 
> I am still waking up so bear with me, but is the reason the mmio_total
>>= GB(2) check does not guard is because the compiler may choose
> to execute _both_ parts of the '||' conditional (or swap them and
> execute the 'mmio_total >= GB(2)' second)?

No, it's because you subtract more than just mmio_total from GB(2).

>> Furthermore you're now making behavior dependent on the order
>> devices appear on the bus: The same device appearing early may
>> get its BAR placed below 4Gb whereas when it appears late, it'll
>> get placed high. IOW I think this needs further refinement: We
>> should in a first pass place only 32-bit BARs. In a second pass we
>> can then see which 64-bit BARs still fit (and I think we then ought
>> to prefer small ones). Which means we should presumably account
>> 32- and 64-bit BARs here independent of any other considerations,
>> deferring the decision which 64-bit ones to place low until after this
>> first pass.
> 
> Ok, that is going to require some surgery and movement of code to add
> some functions in that giant piece of code. Expect more patches next
> week (or would it be easier if I just sent them out for the next release
> considering the amount of patches that are floating this week that need
> review?)

Well, I would view this as a bug fix, so it might still be allowed in.
Ask Wei if in doubt.

Jan
Wei Liu Sept. 29, 2016, 10:48 a.m. UTC | #4
On Thu, Sep 29, 2016 at 03:36:00AM -0600, Jan Beulich wrote:
> >>> On 29.09.16 at 11:23, <konrad.wilk@oracle.com> wrote:
> > On Thu, Sep 29, 2016 at 01:03:02AM -0600, Jan Beulich wrote:
> >> >>> On 29.09.16 at 01:48, <konrad.wilk@oracle.com> wrote:
> >> > @@ -265,11 +266,30 @@ void pci_setup(void)
> >> >              bars[i].devfn   = devfn;
> >> >              bars[i].bar_reg = bar_reg;
> >> >              bars[i].bar_sz  = bar_sz;
> >> > +            bars[i].above_4gb = false;
> >> >  
> >> >              if ( ((bar_data & PCI_BASE_ADDRESS_SPACE) ==
> >> >                    PCI_BASE_ADDRESS_SPACE_MEMORY) ||
> >> >                   (bar_reg == PCI_ROM_ADDRESS) )
> >> > -                mmio_total += bar_sz;
> >> > +            {
> >> > +                /*
> >> > +                 * If bigger than 2GB minus emulated devices BAR space and
> >> > +                 * APIC space, then don't try to put under 4GB.
> >> > +                 */
> >> > +                if ( is_64bar && (mmio_total >= GB(2) || bar_sz >=
> >> > +                     (GB(2) - HVM_BELOW_4G_MMIO_LENGTH - mmio_total)) )
> >> 
> >> As mentioned in the reply to your earlier mail already, the
> >> subtraction of mmio_total here is risking wrap through zero (the
> >> >= GB(2) check doesn't fully guard against that).
> > 
> > I am still waking up so bear with me, but is the reason the mmio_total
> >>= GB(2) check does not guard is because the compiler may choose
> > to execute _both_ parts of the '||' conditional (or swap them and
> > execute the 'mmio_total >= GB(2)' second)?
> 
> No, it's because you subtract more than just mmio_total from GB(2).
> 
> >> Furthermore you're now making behavior dependent on the order
> >> devices appear on the bus: The same device appearing early may
> >> get its BAR placed below 4Gb whereas when it appears late, it'll
> >> get placed high. IOW I think this needs further refinement: We
> >> should in a first pass place only 32-bit BARs. In a second pass we
> >> can then see which 64-bit BARs still fit (and I think we then ought
> >> to prefer small ones). Which means we should presumably account
> >> 32- and 64-bit BARs here independent of any other considerations,
> >> deferring the decision which 64-bit ones to place low until after this
> >> first pass.
> > 
> > Ok, that is going to require some surgery and movement of code to add
> > some functions in that giant piece of code. Expect more patches next
> > week (or would it be easier if I just sent them out for the next release
> > considering the amount of patches that are floating this week that need
> > review?)
> 
> Well, I would view this as a bug fix, so it might still be allowed in.
> Ask Wei if in doubt.
> 

Before RC1, sure. After we cut RCs, anything that changes memory layout
of the guests need to be considered carefully.

Wei.

> Jan
>
Konrad Rzeszutek Wilk Sept. 30, 2016, 2:55 p.m. UTC | #5
On Thu, Sep 29, 2016 at 11:48:43AM +0100, Wei Liu wrote:
> On Thu, Sep 29, 2016 at 03:36:00AM -0600, Jan Beulich wrote:
> > >>> On 29.09.16 at 11:23, <konrad.wilk@oracle.com> wrote:
> > > On Thu, Sep 29, 2016 at 01:03:02AM -0600, Jan Beulich wrote:
> > >> >>> On 29.09.16 at 01:48, <konrad.wilk@oracle.com> wrote:
> > >> > @@ -265,11 +266,30 @@ void pci_setup(void)
> > >> >              bars[i].devfn   = devfn;
> > >> >              bars[i].bar_reg = bar_reg;
> > >> >              bars[i].bar_sz  = bar_sz;
> > >> > +            bars[i].above_4gb = false;
> > >> >  
> > >> >              if ( ((bar_data & PCI_BASE_ADDRESS_SPACE) ==
> > >> >                    PCI_BASE_ADDRESS_SPACE_MEMORY) ||
> > >> >                   (bar_reg == PCI_ROM_ADDRESS) )
> > >> > -                mmio_total += bar_sz;
> > >> > +            {
> > >> > +                /*
> > >> > +                 * If bigger than 2GB minus emulated devices BAR space and
> > >> > +                 * APIC space, then don't try to put under 4GB.
> > >> > +                 */
> > >> > +                if ( is_64bar && (mmio_total >= GB(2) || bar_sz >=
> > >> > +                     (GB(2) - HVM_BELOW_4G_MMIO_LENGTH - mmio_total)) )
> > >> 
> > >> As mentioned in the reply to your earlier mail already, the
> > >> subtraction of mmio_total here is risking wrap through zero (the
> > >> >= GB(2) check doesn't fully guard against that).
> > > 
> > > I am still waking up so bear with me, but is the reason the mmio_total
> > >>= GB(2) check does not guard is because the compiler may choose
> > > to execute _both_ parts of the '||' conditional (or swap them and
> > > execute the 'mmio_total >= GB(2)' second)?
> > 
> > No, it's because you subtract more than just mmio_total from GB(2).
> > 
> > >> Furthermore you're now making behavior dependent on the order
> > >> devices appear on the bus: The same device appearing early may
> > >> get its BAR placed below 4Gb whereas when it appears late, it'll
> > >> get placed high. IOW I think this needs further refinement: We
> > >> should in a first pass place only 32-bit BARs. In a second pass we
> > >> can then see which 64-bit BARs still fit (and I think we then ought
> > >> to prefer small ones). Which means we should presumably account
> > >> 32- and 64-bit BARs here independent of any other considerations,
> > >> deferring the decision which 64-bit ones to place low until after this
> > >> first pass.
> > > 
> > > Ok, that is going to require some surgery and movement of code to add
> > > some functions in that giant piece of code. Expect more patches next
> > > week (or would it be easier if I just sent them out for the next release
> > > considering the amount of patches that are floating this week that need
> > > review?)
> > 
> > Well, I would view this as a bug fix, so it might still be allowed in.
> > Ask Wei if in doubt.
> > 
> 
> Before RC1, sure. After we cut RCs, anything that changes memory layout
> of the guests need to be considered carefully.

OK. I will try my best to get it done before then. But if I fail we can
always look at this for the next release. (Along with other patches
that I need to redo).
diff mbox

Patch

diff --git a/tools/firmware/hvmloader/pci.c b/tools/firmware/hvmloader/pci.c
index 416829d..f6194b9 100644
--- a/tools/firmware/hvmloader/pci.c
+++ b/tools/firmware/hvmloader/pci.c
@@ -80,7 +80,7 @@  void pci_setup(void)
 {
     uint8_t is_64bar, using_64bar, bar64_relocate = 0;
     uint32_t devfn, bar_reg, cmd, bar_data, bar_data_upper;
-    uint64_t base, bar_sz, bar_sz_upper, mmio_total = 0;
+    uint64_t base, bar_sz, bar_sz_upper, mmio_total = 0, mmio_64bit_total = 0;
     uint32_t vga_devfn = 256;
     uint16_t class, vendor_id, device_id;
     unsigned int bar, pin, link, isa_irq;
@@ -97,6 +97,7 @@  void pci_setup(void)
         uint32_t devfn;
         uint32_t bar_reg;
         uint64_t bar_sz;
+        bool above_4gb;
     } *bars = (struct bars *)scratch_start;
     unsigned int i, nr_bars = 0;
     uint64_t mmio_hole_size = 0;
@@ -265,11 +266,30 @@  void pci_setup(void)
             bars[i].devfn   = devfn;
             bars[i].bar_reg = bar_reg;
             bars[i].bar_sz  = bar_sz;
+            bars[i].above_4gb = false;
 
             if ( ((bar_data & PCI_BASE_ADDRESS_SPACE) ==
                   PCI_BASE_ADDRESS_SPACE_MEMORY) ||
                  (bar_reg == PCI_ROM_ADDRESS) )
-                mmio_total += bar_sz;
+            {
+                /*
+                 * If bigger than 2GB minus emulated devices BAR space and
+                 * APIC space, then don't try to put under 4GB.
+                 */
+                if ( is_64bar && (mmio_total >= GB(2) || bar_sz >=
+                     (GB(2) - HVM_BELOW_4G_MMIO_LENGTH - mmio_total)) )
+                {
+                    mmio_64bit_total += bar_sz;
+                    bars[i].above_4gb = true;
+                    /*
+                     * As this may not trigger now that mmio_total could be
+                     * less than 2GB, so force it.
+                     */
+                    bar64_relocate = 1;
+                }
+                else
+                    mmio_total += bar_sz;
+            }
 
             nr_bars++;
 
@@ -349,7 +369,7 @@  void pci_setup(void)
             pci_mem_start = hvm_info->low_mem_pgend << PAGE_SHIFT;
     }
 
-    if ( mmio_total > (pci_mem_end - pci_mem_start) )
+    if ( mmio_total > (pci_mem_end - pci_mem_start) || bar64_relocate )
     {
         printf("Low MMIO hole not large enough for all devices,"
                " relocating some BARs to 64-bit\n");
@@ -431,7 +451,7 @@  void pci_setup(void)
          * Should either of those two conditions change, this code will break.
          */
         using_64bar = bars[i].is_64bar && bar64_relocate
-            && (mmio_total > (mem_resource.max - mem_resource.base));
+            && ((mmio_total + mmio_64bit_total) > (mem_resource.max - mem_resource.base));
         bar_data = pci_readl(devfn, bar_reg);
 
         if ( (bar_data & PCI_BASE_ADDRESS_SPACE) ==
@@ -451,7 +471,10 @@  void pci_setup(void)
                 resource = &mem_resource;
                 bar_data &= ~PCI_BASE_ADDRESS_MEM_MASK;
             }
-            mmio_total -= bar_sz;
+            if ( bars[i].above_4gb )
+                mmio_64bit_total -= bar_sz;
+            else
+                mmio_total -= bar_sz;
         }
         else
         {