Message ID | 20180831103816.13479-5-clg@kaod.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | aspeed: misc fixes and enhancements (SMC) | expand |
On 31 August 2018 at 11:38, Cédric Le Goater <clg@kaod.org> wrote: > Now that MMIO execution is supported, introduce a 'mmio-exec' property > to boot directly from CE0 of the FMC controller using a memory region > alias. The name of this property seems to be a reference to QEMU's internals: is there some other name that would make more sense to a user who cares more about the guest and the emulated hardware? > The overhead for the current firmware images using a custom U-Boot is > around 2 seconds, which is fine, but with a U-Boot from mainline, it > takes an extra 50 seconds or so to reach Linux. This might be related > to the fact that a device tree is used. Is this overhead just because we do a lot of execution from the mmio region and that's slower than execution from RAM ? > MMIO execution is not activated by default because until boot time is > improved. Can the guest tell the difference? I'm wondering how much we might need to care about backward compatibility if in future we fix the performance problem, or if we can just switch the default to execute-from-flash without breaking guests. thanks -- PMM
On 09/18/2018 08:44 PM, Peter Maydell wrote: > On 31 August 2018 at 11:38, Cédric Le Goater <clg@kaod.org> wrote: >> Now that MMIO execution is supported, introduce a 'mmio-exec' property >> to boot directly from CE0 of the FMC controller using a memory region >> alias. > > The name of this property seems to be a reference to QEMU's > internals: is there some other name that would make more sense > to a user who cares more about the guest and the emulated hardware? It is true that the user does not care about internals. "execute-from-flash" as you proposed below ? >> The overhead for the current firmware images using a custom U-Boot is >> around 2 seconds, which is fine, but with a U-Boot from mainline, it >> takes an extra 50 seconds or so to reach Linux. This might be related >> to the fact that a device tree is used. > > Is this overhead just because we do a lot of execution from the > mmio region and that's slower than execution from RAM ? That is what it seems but I haven't dug further the problem. The mainline U-Boot also has a rewritten DRAM calibration. Using the logs is too painful so I would need to add some statistics on what is being done. May be number of read/write ops on the flash model that we could dump from the monitor, or some more low-level figures of QEMU internals. >> MMIO execution is not activated by default because until boot time is >> improved. > > Can the guest tell the difference? I'm wondering how much we > might need to care about backward compatibility if in future > we fix the performance problem, or if we can just switch the > default to execute-from-flash without breaking guests. Yes that is a good a point. When performance is fixed, we should remove the ROM region and the machine option. What I am proposing is a more of a work around to analyze a problem than something we would keep in the future. Thanks, C.
On 09/19/2018 08:51 AM, Cédric Le Goater wrote: > On 09/18/2018 08:44 PM, Peter Maydell wrote: >> On 31 August 2018 at 11:38, Cédric Le Goater <clg@kaod.org> wrote: >>> Now that MMIO execution is supported, introduce a 'mmio-exec' property >>> to boot directly from CE0 of the FMC controller using a memory region >>> alias. >> >> The name of this property seems to be a reference to QEMU's >> internals: is there some other name that would make more sense >> to a user who cares more about the guest and the emulated hardware? > > It is true that the user does not care about internals. > > "execute-from-flash" as you proposed below ? > >>> The overhead for the current firmware images using a custom U-Boot is >>> around 2 seconds, which is fine, but with a U-Boot from mainline, it >>> takes an extra 50 seconds or so to reach Linux. This might be related >>> to the fact that a device tree is used. >> >> Is this overhead just because we do a lot of execution from the >> mmio region and that's slower than execution from RAM ? > > That is what it seems but I haven't dug further the problem. The mainline > U-Boot also has a rewritten DRAM calibration. > > Using the logs is too painful so I would need to add some statistics on > what is being done. May be number of read/write ops on the flash model > that we could dump from the monitor, or some more low-level figures of > QEMU internals. These are the number of read operations done on the flash memory region : 922478 ~ 3.5 MBytes OpenBMC U-Boot 20569977 ~ 80 MBytes Mainline U-Boot So we are trashing the TBs I would say. Is there a way to increase the TB size ? or some other TB parameter ? Is that directly linked to the instruction cache size ? I am no expert of that area of QEMU but willing to experiment. Thanks, C. >>> MMIO execution is not activated by default because until boot time is >>> improved. >> >> Can the guest tell the difference? I'm wondering how much we >> might need to care about backward compatibility if in future >> we fix the performance problem, or if we can just switch the >> default to execute-from-flash without breaking guests. > > Yes that is a good a point. When performance is fixed, we should remove > the ROM region and the machine option. What I am proposing is a more of > a work around to analyze a problem than something we would keep in the > future. > > Thanks, > > C. >
On 19 September 2018 at 11:19, Cédric Le Goater <clg@kaod.org> wrote: > These are the number of read operations done on the flash memory region : > > 922478 ~ 3.5 MBytes OpenBMC U-Boot > 20569977 ~ 80 MBytes Mainline U-Boot > > > So we are trashing the TBs I would say. Is there a way to increase the > TB size ? or some other TB parameter ? Is that directly linked to the > instruction cache size ? Well, execution direct from MMIO means we read each instruction as we execute it -- there's no caching of TBs. (We generate a temporary TB with a single insn and throw it away after executing from it.) This is because we can't be sure that we would get the same data value the next time we read from the address. So we definitely expect things to go slower -- I'm just a little surprised that we spend so much time executing from flash devices that aren't in "romd" mode. thanks -- PMM
On 09/24/2018 01:41 PM, Peter Maydell wrote: > On 19 September 2018 at 11:19, Cédric Le Goater <clg@kaod.org> wrote: >> These are the number of read operations done on the flash memory region : >> >> 922478 ~ 3.5 MBytes OpenBMC U-Boot >> 20569977 ~ 80 MBytes Mainline U-Boot >> >> >> So we are trashing the TBs I would say. Is there a way to increase the >> TB size ? or some other TB parameter ? Is that directly linked to the >> instruction cache size ? > > Well, execution direct from MMIO means we read each instruction > as we execute it -- there's no caching of TBs. (We generate a > temporary TB with a single insn and throw it away after executing > from it.) This is because we can't be sure that we would get the > same data value the next time we read from the address. In our case, we should have the same data. So may be I can introduce a read-only region with a cache to speed up the accesses. A bit like this was done with the previous mmio inteface. > So we definitely expect things to go slower -- I'm just a little > surprised that we spend so much time executing from flash devices > that aren't in "romd" mode. Yes. it is very slow but some instructions are loaded more than 250.000 times. See below. Thanks, C. count address ... ... 149113 0x28a18 149113 0x28a1c 149113 0x28a20 149113 0x28a24 149113 0x28a28 149113 0x28a2c 149113 0x28a30 149113 0x28a34 149113 0x28a38 149113 0x28a3c 149113 0x28a40 253584 0x288b4 253584 0x288bc 253584 0x288c0 253584 0x288c8 253584 0x288cc 253584 0x288d4 253584 0x288e0 253584 0x288e4 253584 0x288f0 253584 0x288f4 253584 0x288fc 253584 0x28904 253584 0x28908 253584 0x28910 253584 0x28914 253584 0x28920 253584 0x28924 253584 0x28928 253584 0x2892c 253584 0x28934 253584 0x28938 253584 0x2893c 253584 0x28940 253584 0x28944 253584 0x2894c 253584 0x28954 253584 0x28958 253584 0x2895c 253584 0x28960 253584 0x28964 253584 0x28968 253584 0x2896c 253584 0x28970 253584 0x28974 253584 0x28978 253584 0x2897c 253584 0x28980 253584 0x53364 253585 0x288b8 253585 0x288c4 253585 0x288d0 253585 0x288d8 253585 0x288dc 253585 0x288f8 253585 0x28900 253585 0x2890c 253585 0x28918 253585 0x2891c 253585 0x28948 253585 0x53344 253586 0x28930 253586 0x28950 253587 0x288b0 257497 0x53348 277523 0x53354
On 24 September 2018 at 14:28, Cédric Le Goater <clg@kaod.org> wrote: > On 09/24/2018 01:41 PM, Peter Maydell wrote: >> On 19 September 2018 at 11:19, Cédric Le Goater <clg@kaod.org> wrote: >>> These are the number of read operations done on the flash memory region : >>> >>> 922478 ~ 3.5 MBytes OpenBMC U-Boot >>> 20569977 ~ 80 MBytes Mainline U-Boot >>> >>> >>> So we are trashing the TBs I would say. Is there a way to increase the >>> TB size ? or some other TB parameter ? Is that directly linked to the >>> instruction cache size ? >> >> Well, execution direct from MMIO means we read each instruction >> as we execute it -- there's no caching of TBs. (We generate a >> temporary TB with a single insn and throw it away after executing >> from it.) This is because we can't be sure that we would get the >> same data value the next time we read from the address. > > In our case, we should have the same data. So may be I can introduce > a read-only region with a cache to speed up the accesses. A bit like > this was done with the previous mmio inteface. Yes, this is in theory possible, but we dropped the old mmio-exec interface because it had some nasty race conditions: the problem is being able to be sure you've properly dropped the cached TBs when the device contents change. Which flash device is this? For pflash typically the 'romd mode' stuff suffices to allow execution as-if-from-ram most of the time. thanks -- PMM
On 09/24/2018 03:31 PM, Peter Maydell wrote: > On 24 September 2018 at 14:28, Cédric Le Goater <clg@kaod.org> wrote: >> On 09/24/2018 01:41 PM, Peter Maydell wrote: >>> On 19 September 2018 at 11:19, Cédric Le Goater <clg@kaod.org> wrote: >>>> These are the number of read operations done on the flash memory region : >>>> >>>> 922478 ~ 3.5 MBytes OpenBMC U-Boot >>>> 20569977 ~ 80 MBytes Mainline U-Boot >>>> >>>> >>>> So we are trashing the TBs I would say. Is there a way to increase the >>>> TB size ? or some other TB parameter ? Is that directly linked to the >>>> instruction cache size ? >>> >>> Well, execution direct from MMIO means we read each instruction >>> as we execute it -- there's no caching of TBs. (We generate a >>> temporary TB with a single insn and throw it away after executing >>> from it.) This is because we can't be sure that we would get the >>> same data value the next time we read from the address. >> >> In our case, we should have the same data. So may be I can introduce >> a read-only region with a cache to speed up the accesses. A bit like >> this was done with the previous mmio inteface. > > Yes, this is in theory possible, but we dropped the old > mmio-exec interface because it had some nasty race conditions: > the problem is being able to be sure you've properly dropped > the cached TBs when the device contents change. > > Which flash device is this? For pflash typically the 'romd mode' > stuff suffices to allow execution as-if-from-ram most of the time. It's a SPI slave flash device attached to the Aspeed SMC controller. It works very well with in a 'romd mode'. I am just stressing a bit more the model with MMIO execution but it's not something we really need. So we can just drop this patch in the v2 patchset "aspeed: misc fixes and enhancements (SMC)" in which I fixed a few other things. Thanks, C.
diff --git a/include/hw/arm/aspeed.h b/include/hw/arm/aspeed.h index 2b77f8d2b3c8..d079f4d6e5db 100644 --- a/include/hw/arm/aspeed.h +++ b/include/hw/arm/aspeed.h @@ -30,6 +30,8 @@ typedef struct AspeedBoardConfig { typedef struct AspeedMachine { MachineState parent_obj; + + bool mmio_exec; } AspeedMachine; #define ASPEED_MACHINE_CLASS(klass) \ diff --git a/hw/arm/aspeed.c b/hw/arm/aspeed.c index 6b33ecd5aa43..3a66c2dedc3e 100644 --- a/hw/arm/aspeed.c +++ b/hw/arm/aspeed.c @@ -216,11 +216,18 @@ static void aspeed_board_init(MachineState *machine, * SoC and 128MB for the AST2500 SoC, which is twice as big as * needed by the flash modules of the Aspeed machines. */ - memory_region_init_rom(boot_rom, OBJECT(bmc), "aspeed.boot_rom", - fl->size, &error_abort); - memory_region_add_subregion(get_system_memory(), FIRMWARE_ADDR, - boot_rom); - write_boot_rom(drive0, FIRMWARE_ADDR, fl->size, &error_abort); + if (ASPEED_MACHINE(machine)->mmio_exec) { + memory_region_init_alias(boot_rom, OBJECT(bmc), "aspeed.boot_rom", + &fl->mmio, 0, fl->size); + memory_region_add_subregion(get_system_memory(), FIRMWARE_ADDR, + boot_rom); + } else { + memory_region_init_rom(boot_rom, OBJECT(bmc), "aspeed.boot_rom", + fl->size, &error_abort); + memory_region_add_subregion(get_system_memory(), FIRMWARE_ADDR, + boot_rom); + write_boot_rom(drive0, FIRMWARE_ADDR, fl->size, &error_abort); + } } aspeed_board_binfo.kernel_filename = machine->kernel_filename; @@ -313,6 +320,29 @@ static void aspeed_machine_init(MachineState *machine) aspeed_board_init(machine, amc->board); } +static bool aspeed_get_mmio_exec(Object *obj, Error **errp) +{ + return ASPEED_MACHINE(obj)->mmio_exec; +} + +static void aspeed_set_mmio_exec(Object *obj, bool value, Error **errp) +{ + ASPEED_MACHINE(obj)->mmio_exec = value; +} + +static void aspeed_machine_instance_init(Object *obj) +{ + ASPEED_MACHINE(obj)->mmio_exec = false; +} + +static void aspeed_machine_class_props_init(ObjectClass *oc) +{ + object_class_property_add_bool(oc, "mmio-exec", aspeed_get_mmio_exec, + aspeed_set_mmio_exec, &error_abort); + object_class_property_set_description(oc, "mmio-exec", + "boot using MMIO execution", &error_abort); +} + static void aspeed_machine_class_init(ObjectClass *oc, void *data) { MachineClass *mc = MACHINE_CLASS(oc); @@ -327,6 +357,8 @@ static void aspeed_machine_class_init(ObjectClass *oc, void *data) mc->no_cdrom = 1; mc->no_parallel = 1; amc->board = board; + + aspeed_machine_class_props_init(oc); } static const TypeInfo aspeed_machine_type = { @@ -334,6 +366,7 @@ static const TypeInfo aspeed_machine_type = { .parent = TYPE_MACHINE, .instance_size = sizeof(AspeedMachine), .class_size = sizeof(AspeedMachineClass), + .instance_init = aspeed_machine_instance_init, .abstract = true, };
Now that MMIO execution is supported, introduce a 'mmio-exec' property to boot directly from CE0 of the FMC controller using a memory region alias. The overhead for the current firmware images using a custom U-Boot is around 2 seconds, which is fine, but with a U-Boot from mainline, it takes an extra 50 seconds or so to reach Linux. This might be related to the fact that a device tree is used. MMIO execution is not activated by default because until boot time is improved. Signed-off-by: Cédric Le Goater <clg@kaod.org> --- include/hw/arm/aspeed.h | 2 ++ hw/arm/aspeed.c | 43 ++++++++++++++++++++++++++++++++++++----- 2 files changed, 40 insertions(+), 5 deletions(-)