Message ID | 20250211111853.2199764-1-grygorii_strashko@epam.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | device-tree: optimize dt_device_for_passthrough() | expand |
On 11/02/2025 12:18, Grygorii Strashko wrote: > > > The dt_device_for_passthrough() is called many times during Xen > initialization and Dom0 creation. On every call it traverses struct > dt_device_node properties list and compares compares properties name with double "compares" > "xen,passthrough" which is runtime overhead. This can be optimized by Are you sure? Looking at the calls, it's almost only used at boot except for dt overlay. > marking dt_device_node as passthrough while unflattening DT. > > This patch introduced new struct dt_device_node property "is_passthrough" > which is filled if "xen,passthrough" property is present while unflattening > DT and dt_device_for_passthrough() just return it's value. In the past we were skeptical about adding new fields to the dt_device_node structure for use cases like this one. I would say this optimization is not worth it. Also, why would you optimize dt_device_for_passthrough but not e.g. dt_device_is_available. You can check with other Arm maintainers. ~Michal > > Signed-off-by: Grygorii Strashko <grygorii_strashko@epam.com> > --- > xen/common/device-tree/device-tree.c | 7 +++++-- > xen/include/xen/device_tree.h | 2 ++ > 2 files changed, 7 insertions(+), 2 deletions(-) > > diff --git a/xen/common/device-tree/device-tree.c b/xen/common/device-tree/device-tree.c > index d0528c582565..a329aaf576da 100644 > --- a/xen/common/device-tree/device-tree.c > +++ b/xen/common/device-tree/device-tree.c > @@ -1682,8 +1682,7 @@ bool dt_device_is_available(const struct dt_device_node *device) > > bool dt_device_for_passthrough(const struct dt_device_node *device) > { > - return (dt_find_property(device, "xen,passthrough", NULL) != NULL); > - > + return device->is_passthrough; > } > > static int __dt_parse_phandle_with_args(const struct dt_device_node *np, > @@ -1913,6 +1912,7 @@ static unsigned long unflatten_dt_node(const void *fdt, > np->used_by = 0; > /* By default the device is not protected */ > np->is_protected = false; > + np->is_passthrough = false; > INIT_LIST_HEAD(&np->domain_list); > > if ( new_format ) > @@ -2001,6 +2001,9 @@ static unsigned long unflatten_dt_node(const void *fdt, > * stuff */ > if ( strcmp(pname, "ibm,phandle") == 0 ) > np->phandle = be32_to_cpup((__be32 *)*p); > + > + if ( strcmp(pname, "xen,passthrough") == 0 ) > + np->is_passthrough = true; > pp->name = pname; > pp->length = sz; > pp->value = (void *)*p; > diff --git a/xen/include/xen/device_tree.h b/xen/include/xen/device_tree.h > index 5ff763bb80bb..96001d5b7843 100644 > --- a/xen/include/xen/device_tree.h > +++ b/xen/include/xen/device_tree.h > @@ -94,6 +94,8 @@ struct dt_device_node { > > /* IOMMU specific fields */ > bool is_protected; > + /* Indicates DT device is for passthrough */ > + bool is_passthrough; > > #ifdef CONFIG_STATIC_EVTCHN > /* HACK: Remove this if there is a need of space */ > -- > 2.34.1 >
On 11/02/2025 11:57, Orzel, Michal wrote: > On 11/02/2025 12:18, Grygorii Strashko wrote: >> >> >> The dt_device_for_passthrough() is called many times during Xen >> initialization and Dom0 creation. On every call it traverses struct >> dt_device_node properties list and compares compares properties name with > double "compares" > >> "xen,passthrough" which is runtime overhead. This can be optimized by > Are you sure? Looking at the calls, it's almost only used at boot except for dt > overlay. > >> marking dt_device_node as passthrough while unflattening DT. >> >> This patch introduced new struct dt_device_node property "is_passthrough" >> which is filled if "xen,passthrough" property is present while unflattening >> DT and dt_device_for_passthrough() just return it's value. > In the past we were skeptical about adding new fields to the dt_device_node > structure for use cases like this one. I would say this optimization is not > worth it. Also, why would you optimize dt_device_for_passthrough but not > e.g. dt_device_is_available. So we are trading speed with memory usage. It looks like we may be using a padding, although I didn't check whether the existing structure could be packed... > > You can check with other Arm maintainers. Before forging an opinion, I'd like to see some numbers showing the performance improvement. Also, is this impacting only boot? Also, I agree with Michal, if this is a concern for dt_device_device_for_passthrough(). Then it would be a concern for a few others calls using dt_find_property(). Are you going to modify all of them? Cheers,
On 11.02.25 14:32, Julien Grall wrote: > > > On 11/02/2025 11:57, Orzel, Michal wrote: >> On 11/02/2025 12:18, Grygorii Strashko wrote: >>> >>> >>> The dt_device_for_passthrough() is called many times during Xen >>> initialization and Dom0 creation. On every call it traverses struct >>> dt_device_node properties list and compares compares properties name with >> double "compares" 10x >> >>> "xen,passthrough" which is runtime overhead. This can be optimized by >> Are you sure? Looking at the calls, it's almost only used at boot except for dt >> overlay. >> >>> marking dt_device_node as passthrough while unflattening DT. >>> >>> This patch introduced new struct dt_device_node property "is_passthrough" >>> which is filled if "xen,passthrough" property is present while unflattening >>> DT and dt_device_for_passthrough() just return it's value. >> In the past we were skeptical about adding new fields to the dt_device_node >> structure for use cases like this one. I would say this optimization is not >> worth it. Also, why would you optimize dt_device_for_passthrough but not >> e.g. dt_device_is_available. > > So we are trading speed with memory usage. It looks like we may be using a padding, although I didn't check whether the existing structure could be packed... Actually I see it consumes nothing due to alignments: - before sizeof(dt_device_node)=144 - after sizeof(dt_device_node)=144 But it could be made bool is_passthrough:1; together with other bools, and probably moved at the end of struct dt_device_node. By the way, below diff can save 8 bytes on arm64 per struct dt_device_node. > >> >> You can check with other Arm maintainers. > > Before forging an opinion, I'd like to see some numbers showing the performance improvement. Also, is this impacting only boot? Sry, indeed only boot, need to be more specific. My DT: ~700 nodes Number of calls till the end of create_dom0(): (XEN) =============== dt_device_for_passthrough=2684 dt_device_is_available=1429. I've enabled console_timestamps=boot and did 5 boots and calculated average - it gives ~20 microseconds improvement. > > Also, I agree with Michal, if this is a concern for dt_device_device_for_passthrough(). Then it would be a concern for a few others calls using dt_find_property(). Are you going to modify all of them? I follow the rule one patch one functional change. Regarding further optimization - I think only generic properties can be optimized and personally I see now only "xen,passthrough" and may be "status". Ok. 20 microseconds. it's probably more like a measurement error margin. Please advice if i should continue or drop? Thank you for you comments. -grygorii --- diff --git a/xen/include/xen/device_tree.h b/xen/include/xen/device_tree.h index 96001d5b7843..0448cc297445 100644 --- a/xen/include/xen/device_tree.h +++ b/xen/include/xen/device_tree.h @@ -81,8 +81,8 @@ struct dt_property { struct dt_device_node { const char *name; const char *type; - dt_phandle phandle; char *full_name; + dt_phandle phandle; domid_t used_by; /* By default it's used by dom0 */ struct dt_property *properties;
On 11/02/2025 16:14, Grygorii Strashko wrote: > > > On 11.02.25 14:32, Julien Grall wrote: >> >> >> On 11/02/2025 11:57, Orzel, Michal wrote: >>> On 11/02/2025 12:18, Grygorii Strashko wrote: >>>> >>>> >>>> The dt_device_for_passthrough() is called many times during Xen >>>> initialization and Dom0 creation. On every call it traverses struct >>>> dt_device_node properties list and compares compares properties name with >>> double "compares" > > 10x > >>> >>>> "xen,passthrough" which is runtime overhead. This can be optimized by >>> Are you sure? Looking at the calls, it's almost only used at boot except for dt >>> overlay. >>> >>>> marking dt_device_node as passthrough while unflattening DT. >>>> >>>> This patch introduced new struct dt_device_node property "is_passthrough" >>>> which is filled if "xen,passthrough" property is present while unflattening >>>> DT and dt_device_for_passthrough() just return it's value. >>> In the past we were skeptical about adding new fields to the dt_device_node >>> structure for use cases like this one. I would say this optimization is not >>> worth it. Also, why would you optimize dt_device_for_passthrough but not >>> e.g. dt_device_is_available. >> >> So we are trading speed with memory usage. It looks like we may be using a padding, although I didn't check whether the existing structure could be packed... > > Actually I see it consumes nothing due to alignments: > - before sizeof(dt_device_node)=144 > - after sizeof(dt_device_node)=144 > > But it could be made bool is_passthrough:1; together with other bools, and probably moved at the end of struct dt_device_node. > > By the way, below diff can save 8 bytes on arm64 per struct dt_device_node. See below. > >> >>> >>> You can check with other Arm maintainers. >> >> Before forging an opinion, I'd like to see some numbers showing the performance improvement. Also, is this impacting only boot? > > Sry, indeed only boot, need to be more specific. > > My DT: ~700 nodes > Number of calls till the end of create_dom0(): > (XEN) =============== dt_device_for_passthrough=2684 dt_device_is_available=1429. > > I've enabled console_timestamps=boot and did 5 boots and calculated average - it gives ~20 microseconds improvement. > > >> >> Also, I agree with Michal, if this is a concern for dt_device_device_for_passthrough(). Then it would be a concern for a few others calls using dt_find_property(). Are you going to modify all of them? > > I follow the rule one patch one functional change. Regarding further optimization - I think only generic properties can be optimized and personally I see now only "xen,passthrough" and may be "status". > > Ok. 20 microseconds. it's probably more like a measurement error margin. > Please advice if i should continue or drop? I'd suggest to drop it. If we want to optimize the struct, then it does not make sense to do this only for one architecture. If at all, I would do: diff --git a/xen/include/xen/device_tree.h b/xen/include/xen/device_tree.h index 5ff763bb80bb..0ff80fda04da 100644 --- a/xen/include/xen/device_tree.h +++ b/xen/include/xen/device_tree.h @@ -81,17 +81,10 @@ struct dt_property { struct dt_device_node { const char *name; const char *type; - dt_phandle phandle; char *full_name; + dt_phandle phandle; domid_t used_by; /* By default it's used by dom0 */ - struct dt_property *properties; - struct dt_device_node *parent; - struct dt_device_node *child; - struct dt_device_node *sibling; - struct dt_device_node *next; /* TODO: Remove it. Only use to know the last children */ - struct dt_device_node *allnext; - /* IOMMU specific fields */ bool is_protected; @@ -100,6 +93,13 @@ struct dt_device_node { bool static_evtchn_created; #endif + struct dt_property *properties; + struct dt_device_node *parent; + struct dt_device_node *child; + struct dt_device_node *sibling; + struct dt_device_node *next; /* TODO: Remove it. Only use to know the last children */ + struct dt_device_node *allnext; + /* * The main purpose of this list is to link the structure in the list * of devices assigned to domain. Results (defconfig): ARM64: before: 144B, 16B holes ARM64: after: 128B, no holes (-16B) ARM32: before: 72B, 4B holes ARM32: after: 68B, no holes (-4B) ~Michal
Hi, On 11/02/2025 15:14, Grygorii Strashko wrote: > > > On 11.02.25 14:32, Julien Grall wrote: >> >> >> On 11/02/2025 11:57, Orzel, Michal wrote: >>> On 11/02/2025 12:18, Grygorii Strashko wrote: >>>> >>>> >>>> The dt_device_for_passthrough() is called many times during Xen >>>> initialization and Dom0 creation. On every call it traverses struct >>>> dt_device_node properties list and compares compares properties name >>>> with >>> double "compares" > > 10x > >>> >>>> "xen,passthrough" which is runtime overhead. This can be optimized by >>> Are you sure? Looking at the calls, it's almost only used at boot >>> except for dt >>> overlay. >>> >>>> marking dt_device_node as passthrough while unflattening DT. >>>> >>>> This patch introduced new struct dt_device_node property >>>> "is_passthrough" >>>> which is filled if "xen,passthrough" property is present while >>>> unflattening >>>> DT and dt_device_for_passthrough() just return it's value. >>> In the past we were skeptical about adding new fields to the >>> dt_device_node >>> structure for use cases like this one. I would say this optimization >>> is not >>> worth it. Also, why would you optimize dt_device_for_passthrough but not >>> e.g. dt_device_is_available. >> >> So we are trading speed with memory usage. It looks like we may be >> using a padding, although I didn't check whether the existing >> structure could be packed... > > Actually I see it consumes nothing due to alignments: > - before sizeof(dt_device_node)=144 > - after sizeof(dt_device_node)=144 > > But it could be made bool is_passthrough:1; together with other bools, > and probably moved at the end of struct dt_device_node. > > By the way, below diff can save 8 bytes on arm64 per struct dt_device_node. > >> >>> >>> You can check with other Arm maintainers. >> >> Before forging an opinion, I'd like to see some numbers showing the >> performance improvement. Also, is this impacting only boot? > > Sry, indeed only boot, need to be more specific. > > My DT: ~700 nodes > Number of calls till the end of create_dom0(): > (XEN) =============== dt_device_for_passthrough=2684 > dt_device_is_available=1429. > > I've enabled console_timestamps=boot and did 5 boots and calculated > average - it gives ~20 microseconds improvement. This doesn't seem to be worth it. But I also don't know what's the normal boot time on your system... I guess we are still in seconds? >> Also, I agree with Michal, if this is a concern for >> dt_device_device_for_passthrough(). Then it would be a concern for a >> few others calls using dt_find_property(). Are you going to modify all >> of them? > > I follow the rule one patch one functional change. Regarding further > optimization - I think only generic properties can be optimized and > personally I see now only "xen,passthrough" and may be "status". > > Ok. 20 microseconds. it's probably more like a measurement error margin. > Please advice if i should continue or drop? See above. Regardless that, would you be able to provide a bit more information of your end goal? Are you intending to be able to boot Xen/dom0/dom0less guest in less than N milliseconds? How far are you from that target? Are those numbers done on the latest Xen? Asking because there are probably bigger place where optimization can be done first. Cheers,
On 12.02.25 11:04, Julien Grall wrote: > Hi, > > On 11/02/2025 15:14, Grygorii Strashko wrote: >> >> >> On 11.02.25 14:32, Julien Grall wrote: >>> >>> >>> On 11/02/2025 11:57, Orzel, Michal wrote: >>>> On 11/02/2025 12:18, Grygorii Strashko wrote: >>>>> >>>>> >>>>> The dt_device_for_passthrough() is called many times during Xen >>>>> initialization and Dom0 creation. On every call it traverses struct >>>>> dt_device_node properties list and compares compares properties name with >>>> double "compares" >> >> 10x >> >>>> >>>>> "xen,passthrough" which is runtime overhead. This can be optimized by >>>> Are you sure? Looking at the calls, it's almost only used at boot except for dt >>>> overlay. >>>> >>>>> marking dt_device_node as passthrough while unflattening DT. >>>>> >>>>> This patch introduced new struct dt_device_node property "is_passthrough" >>>>> which is filled if "xen,passthrough" property is present while unflattening >>>>> DT and dt_device_for_passthrough() just return it's value. >>>> In the past we were skeptical about adding new fields to the dt_device_node >>>> structure for use cases like this one. I would say this optimization is not >>>> worth it. Also, why would you optimize dt_device_for_passthrough but not >>>> e.g. dt_device_is_available. >>> >>> So we are trading speed with memory usage. It looks like we may be using a padding, although I didn't check whether the existing structure could be packed... >> >> Actually I see it consumes nothing due to alignments: >> - before sizeof(dt_device_node)=144 >> - after sizeof(dt_device_node)=144 >> >> But it could be made bool is_passthrough:1; together with other bools, and probably moved at the end of struct dt_device_node. >> >> By the way, below diff can save 8 bytes on arm64 per struct dt_device_node. >> >>> >>>> >>>> You can check with other Arm maintainers. >>> >>> Before forging an opinion, I'd like to see some numbers showing the performance improvement. Also, is this impacting only boot? >> >> Sry, indeed only boot, need to be more specific. >> >> My DT: ~700 nodes >> Number of calls till the end of create_dom0(): >> (XEN) =============== dt_device_for_passthrough=2684 dt_device_is_available=1429. >> >> I've enabled console_timestamps=boot and did 5 boots and calculated average - it gives ~20 microseconds improvement. > > This doesn't seem to be worth it. But I also don't know what's the normal boot time on your system... I guess we are still in seconds? Yes. in general if exclude SILO timeout. (XEN) [ 0.433789] *************************************************** (XEN) [ 0.434588] WARNING: SILO mode is not enabled. (XEN) [ 0.435204] It has implications on the security of the system, (XEN) [ 0.435992] unless the communications have been forbidden between (XEN) [ 0.436813] untrusted domains. (XEN) [ 0.437256] *************************************************** (XEN) [ 0.438055] 3... 2... 1... (XEN) [ 3.438566] *** Serial input to DOM0 (type 'CTRL-a' three times to switch input) (XEN) [ 3.439559] Freed 400kB init memory. > >>> Also, I agree with Michal, if this is a concern for dt_device_device_for_passthrough(). Then it would be a concern for a few others calls using dt_find_property(). Are you going to modify all of them? >> >> I follow the rule one patch one functional change. Regarding further optimization - I think only generic properties can be optimized and personally I see now only "xen,passthrough" and may be "status". >> >> Ok. 20 microseconds. it's probably more like a measurement error margin. >> Please advice if i should continue or drop? > > See above. Regardless that, would you be able to provide a bit more information of your end goal?Are you intending to be able to boot Xen/dom0/dom0less guest in less than N milliseconds? How far are you from that target? Are those numbers done on the latest Xen? It's more like result of code studying from my side as Xen newbie. I've considered it as kinda "obvious" change - calc some value once is better then do the same calculations many times. Ok, it's not obvious. I'll drop it. > > Asking because there are probably bigger place where optimization can be done first. Thanks, -grygorii
On 12/02/2025 10:16, Grygorii Strashko wrote: >>>> Also, I agree with Michal, if this is a concern for >>>> dt_device_device_for_passthrough(). Then it would be a concern for a >>>> few others calls using dt_find_property(). Are you going to modify >>>> all of them? >>> >>> I follow the rule one patch one functional change. Regarding further >>> optimization - I think only generic properties can be optimized and >>> personally I see now only "xen,passthrough" and may be "status". >>> >>> Ok. 20 microseconds. it's probably more like a measurement error margin. >>> Please advice if i should continue or drop? >> >> See above. Regardless that, would you be able to provide a bit more >> information of your end goal?Are you intending to be able to boot Xen/ >> dom0/dom0less guest in less than N milliseconds? > How far are you from that target? Are those numbers done on the latest Xen? > > It's more like result of code studying from my side as Xen newbie. > I've considered it as kinda "obvious" change - calc some value once is Thi i > better then do the same calculations many times. To clarify, there are a lot of call to the function. But you also have a DT with 700 nodes. So effectively, you will call it ~4 times per node. But really, the issue is not the number of calls. The issue is that the property lookup function is expensive because there are many string comparisons. > Ok, it's not obvious. I'll drop it. It is a trade-off between speed, memory usage and maintenability. With your approach, we would need to look up for all the properties we care in advance and cache the result. For bool property, the memory usage increase is not "bad". However, it would require more memory if the property value is a string, multiple cells... What I would find more interesting is whether we can optimize the function doing the property lookup. This would benefits everyone rather than just a select few and depending on the result (speed, memory usage) I would definitely consider to include in Xen. Cheers,
diff --git a/xen/common/device-tree/device-tree.c b/xen/common/device-tree/device-tree.c index d0528c582565..a329aaf576da 100644 --- a/xen/common/device-tree/device-tree.c +++ b/xen/common/device-tree/device-tree.c @@ -1682,8 +1682,7 @@ bool dt_device_is_available(const struct dt_device_node *device) bool dt_device_for_passthrough(const struct dt_device_node *device) { - return (dt_find_property(device, "xen,passthrough", NULL) != NULL); - + return device->is_passthrough; } static int __dt_parse_phandle_with_args(const struct dt_device_node *np, @@ -1913,6 +1912,7 @@ static unsigned long unflatten_dt_node(const void *fdt, np->used_by = 0; /* By default the device is not protected */ np->is_protected = false; + np->is_passthrough = false; INIT_LIST_HEAD(&np->domain_list); if ( new_format ) @@ -2001,6 +2001,9 @@ static unsigned long unflatten_dt_node(const void *fdt, * stuff */ if ( strcmp(pname, "ibm,phandle") == 0 ) np->phandle = be32_to_cpup((__be32 *)*p); + + if ( strcmp(pname, "xen,passthrough") == 0 ) + np->is_passthrough = true; pp->name = pname; pp->length = sz; pp->value = (void *)*p; diff --git a/xen/include/xen/device_tree.h b/xen/include/xen/device_tree.h index 5ff763bb80bb..96001d5b7843 100644 --- a/xen/include/xen/device_tree.h +++ b/xen/include/xen/device_tree.h @@ -94,6 +94,8 @@ struct dt_device_node { /* IOMMU specific fields */ bool is_protected; + /* Indicates DT device is for passthrough */ + bool is_passthrough; #ifdef CONFIG_STATIC_EVTCHN /* HACK: Remove this if there is a need of space */
The dt_device_for_passthrough() is called many times during Xen initialization and Dom0 creation. On every call it traverses struct dt_device_node properties list and compares compares properties name with "xen,passthrough" which is runtime overhead. This can be optimized by marking dt_device_node as passthrough while unflattening DT. This patch introduced new struct dt_device_node property "is_passthrough" which is filled if "xen,passthrough" property is present while unflattening DT and dt_device_for_passthrough() just return it's value. Signed-off-by: Grygorii Strashko <grygorii_strashko@epam.com> --- xen/common/device-tree/device-tree.c | 7 +++++-- xen/include/xen/device_tree.h | 2 ++ 2 files changed, 7 insertions(+), 2 deletions(-)