mbox series

[RFC,0/2] ACPI / PPTT: ids for caches

Message ID 20181005150235.13846-1-james.morse@arm.com (mailing list archive)
Headers show
Series ACPI / PPTT: ids for caches | expand

Message

James Morse Oct. 5, 2018, 3:02 p.m. UTC
Hi guys,

To get resctrl working on arm64, we need to generate 'id's for caches.
This is this value that shows up in, e.g.:
| /sys/devices/system/cpu/cpu0/cache/index3/id

This value needs to be unique for each level of cache, but doesn't
need to be contiguous. (there may be gaps, it may not start at 0).
Details in Documentation/x86/intel_rdt_ui.txt::Cache IDs

resctrl receives these values back via its schemata file. e.g.:
| echo "L3:0=fff;1=fff" > /sys/fs/resctrl/p1/schemata
Where 0 and 1 are the ids of two caches in the system.

These values become ABI, and are likely to be baked into shell scripts.
We want a value that is the same over reboots, and should be the same
on identical hardware, even if the PPTT is generated in a different
order. The hardware doesn't give us any indication of which caches are
shared, so this information must come from firmware tables.

This series generates an id from the PPTT topology, based on the lowest
MPIDR of the cpus that share a cache.

The remaining problems with this approach are:
 * the 32bit ID field is full of MPIDR.Aff{0-3}. We don't have space to
   hide 'i/d/unified', so can only generate ids for unified caches. If we
   ever get an Aff4 (plenty of RES0 space in there) we can no longer generate
   an id. Having all these bits accounted for in the initial version doesn't
   feel like a good ABI choice.

* Existing software is going to assume caches are numbered 0,1,2. This was
  documented as not guaranteed, and its likely never going to be the case
  if we generate ids like this.

* The table walk is recursive.


Fixes for the first two require extra-code to compact the ID range, which would
require us generating all the IDs up front, not from hotplug callbacks as has
to happen today.

Alternatively, we could try and change the abi to provide a u64 as the
cache id. The size isn't documented, and for resctrl userspace can treat
it as a string.

Better ideas welcome!

Thanks,

James Morse (2):
  ACPI / processor: Add helper to convert acpi_id to a phys_cpuid
  ACPI / PPTT: cacheinfo: Label caches based on fw_token

 arch/arm64/include/asm/acpi.h |  6 +++
 drivers/acpi/pptt.c           | 81 +++++++++++++++++++++++++++++++++++
 drivers/acpi/processor_core.c | 16 +++++++
 include/acpi/processor.h      |  1 +
 4 files changed, 104 insertions(+)

Comments

Jeffrey Hugo Oct. 5, 2018, 3:20 p.m. UTC | #1
On 10/5/2018 9:02 AM, James Morse wrote:
> Hi guys,
> 
> To get resctrl working on arm64, we need to generate 'id's for caches.
> This is this value that shows up in, e.g.:
> | /sys/devices/system/cpu/cpu0/cache/index3/id
> 
> This value needs to be unique for each level of cache, but doesn't
> need to be contiguous. (there may be gaps, it may not start at 0).
> Details in Documentation/x86/intel_rdt_ui.txt::Cache IDs
> 
> resctrl receives these values back via its schemata file. e.g.:
> | echo "L3:0=fff;1=fff" > /sys/fs/resctrl/p1/schemata
> Where 0 and 1 are the ids of two caches in the system.
> 
> These values become ABI, and are likely to be baked into shell scripts.
> We want a value that is the same over reboots, and should be the same
> on identical hardware, even if the PPTT is generated in a different
> order. The hardware doesn't give us any indication of which caches are
> shared, so this information must come from firmware tables.
> 
> This series generates an id from the PPTT topology, based on the lowest
> MPIDR of the cpus that share a cache.
> 
> The remaining problems with this approach are:
>   * the 32bit ID field is full of MPIDR.Aff{0-3}. We don't have space to
>     hide 'i/d/unified', so can only generate ids for unified caches. If we
>     ever get an Aff4 (plenty of RES0 space in there) we can no longer generate
>     an id. Having all these bits accounted for in the initial version doesn't
>     feel like a good ABI choice.
> 
> * Existing software is going to assume caches are numbered 0,1,2. This was
>    documented as not guaranteed, and its likely never going to be the case
>    if we generate ids like this.
> 
> * The table walk is recursive.
> 
> 
> Fixes for the first two require extra-code to compact the ID range, which would
> require us generating all the IDs up front, not from hotplug callbacks as has
> to happen today.
> 
> Alternatively, we could try and change the abi to provide a u64 as the
> cache id. The size isn't documented, and for resctrl userspace can treat
> it as a string.
> 
> Better ideas welcome!

I'm sorry, I'm not familiar with this resctrl, and therefore I don't 
quite feel like I have a handle on what we need out of the ids file (and 
the Documentation you pointed to doesn't seem to clarify it for me).

Lets assume we have a trivial 4 core system.  Each core has a private 
L1i and L1d cache.  Cores 0/1 and 2/3 share a L2.  Cores 0-3 share L3.

If we are assigning ids in the range 1-N, what might we expect the id of 
each cache to be?

Is this sane (each unique cache instance has a unique id), or have I 
misunderstood?
CPU0 L1i - 1
CPU0 L1d - 2
CPU1 L1i - 3
CPU1 L1d - 4
CPU2 L1i - 5
CPU2 L1d - 6
CPU3 L1i - 7
CPU3 L1d - 8
CPU0/1 L2 - 9
CPU2/3 L2 - 10
        L3 - 11
James Morse Oct. 5, 2018, 3:54 p.m. UTC | #2
Hi Jeffrey,

On 05/10/18 16:20, Jeffrey Hugo wrote:
> On 10/5/2018 9:02 AM, James Morse wrote:
>> To get resctrl working on arm64, we need to generate 'id's for caches.
>> This is this value that shows up in, e.g.:
>> | /sys/devices/system/cpu/cpu0/cache/index3/id
>>
>> This value needs to be unique for each level of cache, but doesn't
>> need to be contiguous. (there may be gaps, it may not start at 0).
>> Details in Documentation/x86/intel_rdt_ui.txt::Cache IDs
>>
>> resctrl receives these values back via its schemata file. e.g.:
>> | echo "L3:0=fff;1=fff" > /sys/fs/resctrl/p1/schemata
>> Where 0 and 1 are the ids of two caches in the system.
>>
>> These values become ABI, and are likely to be baked into shell scripts.
>> We want a value that is the same over reboots, and should be the same
>> on identical hardware, even if the PPTT is generated in a different
>> order. The hardware doesn't give us any indication of which caches are
>> shared, so this information must come from firmware tables.
>>
>> This series generates an id from the PPTT topology, based on the lowest
>> MPIDR of the cpus that share a cache.
>>
>> The remaining problems with this approach are:
>>   * the 32bit ID field is full of MPIDR.Aff{0-3}. We don't have space to
>>     hide 'i/d/unified', so can only generate ids for unified caches. If we
>>     ever get an Aff4 (plenty of RES0 space in there) we can no longer generate
>>     an id. Having all these bits accounted for in the initial version doesn't
>>     feel like a good ABI choice.
>>
>> * Existing software is going to assume caches are numbered 0,1,2. This was
>>    documented as not guaranteed, and its likely never going to be the case
>>    if we generate ids like this.
>>
>> * The table walk is recursive.
>>
>>
>> Fixes for the first two require extra-code to compact the ID range, which would
>> require us generating all the IDs up front, not from hotplug callbacks as has
>> to happen today.
>>
>> Alternatively, we could try and change the abi to provide a u64 as the
>> cache id. The size isn't documented, and for resctrl userspace can treat
>> it as a string.
>>
>> Better ideas welcome!
> 
> I'm sorry, I'm not familiar with this resctrl, and therefore I don't quite feel
> like I have a handle on what we need out of the ids file (and the Documentation
> you pointed to doesn't seem to clarify it for me).

> Lets assume we have a trivial 4 core system.  Each core has a private L1i and
> L1d cache.  Cores 0/1 and 2/3 share a L2.  Cores 0-3 share L3.

The i/d caches wouldn't get an ID, because we can't easily generate unique
values for these. (with this scheme, all the id bits are in use for shared
unified caches).

Cores 0 and 1 should show the same ID for their L2, 2 and 3 should show a
different ID. Cores 0-3 should all show the same id for L3.


> If we are assigning ids in the range 1-N, what might we expect the id of each
> cache to be?
> 
> Is this sane (each unique cache instance has a unique id), or have I misunderstood?
> CPU0 L1i - 1
> CPU0 L1d - 2
> CPU1 L1i - 3
> CPU1 L1d - 4
> CPU2 L1i - 5
> CPU2 L1d - 6
> CPU3 L1i - 7
> CPU3 L1d - 8

> CPU0/1 L2 - 9
> CPU2/3 L2 - 10

>        L3 - 11

This would be sane. We don't need to continue the numbering between L1/L2/L3.
The id only needs to be unique at that level.


The problem is generating these numbers if only some of the CPUs are online, or
if the acpi tables are generated by firmware at power-on and have a different
layout every time.
We don't even want to rely on linux's cpu numbering.

The suggestion here is to use the smallest MPIDR, as that's as hardware property
that won't change even if the tables are generated differently every boot.

Assuming two clusters in your example above, it would look like:

| CPU0/1 (cluster 0) L2 - 0x0
| CPU2/3 (cluster 1) L2 - 0x100
|                    L3 - 0x0



Thanks,

James
Jeffrey Hugo Oct. 5, 2018, 4:39 p.m. UTC | #3
On 10/5/2018 9:54 AM, James Morse wrote:
> Hi Jeffrey,
> 
> On 05/10/18 16:20, Jeffrey Hugo wrote:
>> On 10/5/2018 9:02 AM, James Morse wrote:
>>> To get resctrl working on arm64, we need to generate 'id's for caches.
>>> This is this value that shows up in, e.g.:
>>> | /sys/devices/system/cpu/cpu0/cache/index3/id
>>>
>>> This value needs to be unique for each level of cache, but doesn't
>>> need to be contiguous. (there may be gaps, it may not start at 0).
>>> Details in Documentation/x86/intel_rdt_ui.txt::Cache IDs
>>>
>>> resctrl receives these values back via its schemata file. e.g.:
>>> | echo "L3:0=fff;1=fff" > /sys/fs/resctrl/p1/schemata
>>> Where 0 and 1 are the ids of two caches in the system.
>>>
>>> These values become ABI, and are likely to be baked into shell scripts.
>>> We want a value that is the same over reboots, and should be the same
>>> on identical hardware, even if the PPTT is generated in a different
>>> order. The hardware doesn't give us any indication of which caches are
>>> shared, so this information must come from firmware tables.
>>>
>>> This series generates an id from the PPTT topology, based on the lowest
>>> MPIDR of the cpus that share a cache.
>>>
>>> The remaining problems with this approach are:
>>>    * the 32bit ID field is full of MPIDR.Aff{0-3}. We don't have space to
>>>      hide 'i/d/unified', so can only generate ids for unified caches. If we
>>>      ever get an Aff4 (plenty of RES0 space in there) we can no longer generate
>>>      an id. Having all these bits accounted for in the initial version doesn't
>>>      feel like a good ABI choice.
>>>
>>> * Existing software is going to assume caches are numbered 0,1,2. This was
>>>     documented as not guaranteed, and its likely never going to be the case
>>>     if we generate ids like this.
>>>
>>> * The table walk is recursive.
>>>
>>>
>>> Fixes for the first two require extra-code to compact the ID range, which would
>>> require us generating all the IDs up front, not from hotplug callbacks as has
>>> to happen today.
>>>
>>> Alternatively, we could try and change the abi to provide a u64 as the
>>> cache id. The size isn't documented, and for resctrl userspace can treat
>>> it as a string.
>>>
>>> Better ideas welcome!
>>
>> I'm sorry, I'm not familiar with this resctrl, and therefore I don't quite feel
>> like I have a handle on what we need out of the ids file (and the Documentation
>> you pointed to doesn't seem to clarify it for me).
> 
>> Lets assume we have a trivial 4 core system.  Each core has a private L1i and
>> L1d cache.  Cores 0/1 and 2/3 share a L2.  Cores 0-3 share L3.
> 
> The i/d caches wouldn't get an ID, because we can't easily generate unique
> values for these. (with this scheme, all the id bits are in use for shared
> unified caches).
> 
> Cores 0 and 1 should show the same ID for their L2, 2 and 3 should show a
> different ID. Cores 0-3 should all show the same id for L3.
> 
> 
>> If we are assigning ids in the range 1-N, what might we expect the id of each
>> cache to be?
>>
>> Is this sane (each unique cache instance has a unique id), or have I misunderstood?
>> CPU0 L1i - 1
>> CPU0 L1d - 2
>> CPU1 L1i - 3
>> CPU1 L1d - 4
>> CPU2 L1i - 5
>> CPU2 L1d - 6
>> CPU3 L1i - 7
>> CPU3 L1d - 8
> 
>> CPU0/1 L2 - 9
>> CPU2/3 L2 - 10
> 
>>         L3 - 11
> 
> This would be sane. We don't need to continue the numbering between L1/L2/L3.
> The id only needs to be unique at that level.
> 
> 
> The problem is generating these numbers if only some of the CPUs are online, or
> if the acpi tables are generated by firmware at power-on and have a different
> layout every time.
> We don't even want to rely on linux's cpu numbering.
> 
> The suggestion here is to use the smallest MPIDR, as that's as hardware property
> that won't change even if the tables are generated differently every boot.

I can't think of a reason why affinity level 0 would ever change for a 
particular thread or core (SMT vs non-SMT), however none of the other 
affinity levels have a well defined meaning (implementation dependent), 
and could very well change boot to boot.

I would strongly avoid using MPIDR, particularly for the usecase you've 
described.

> 
> Assuming two clusters in your example above, it would look like:
> 
> | CPU0/1 (cluster 0) L2 - 0x0
> | CPU2/3 (cluster 1) L2 - 0x100
> |                    L3 - 0x0

Thanks for the clarification.  I think I've got enough to wrap my head 
around this.  Let me think on it a bit to see if I can come up with a 
suggestion (we can debate how good it is).
James Morse Oct. 8, 2018, 9:26 a.m. UTC | #4
Hi Jeffrey,

On 05/10/2018 17:39, Jeffrey Hugo wrote:
> On 10/5/2018 9:54 AM, James Morse wrote:
>> The problem is generating these numbers if only some of the CPUs are online, or
>> if the acpi tables are generated by firmware at power-on and have a different
>> layout every time.
>> We don't even want to rely on linux's cpu numbering.
>>
>> The suggestion here is to use the smallest MPIDR, as that's as hardware property
>> that won't change even if the tables are generated differently every boot.
> 
> I can't think of a reason why affinity level 0 would ever change for a

affinity level of what? The caches? Sure, that should be impossible to change.


> particular thread or core (SMT vs non-SMT), however none of the other affinity
> levels have a well defined meaning (implementation dependent), and could very
> well change boot to boot.

Ah, you mean the level numbers. Yes, these are (quasi) discovered from the
table, so can't be relied on.

If you insert a new level then this would shuffle the user-visible indexes and
levels. I would argue this is no longer the same hardware.

Doing this may already break resctrl as the 'L2' and 'L3' numbers are part of
the ABI.

The ids generated would still be unique for a level.


> I would strongly avoid using MPIDR, particularly for the usecase you've described.

Is there an alternative? It needs to be a hardware property to insulate us from
the tables being re-generated.

I agree the MPIDR numbers are likely to be ugly, (I don't really care...).
The u32 id being full from day 1 is more of a problem.


>> Assuming two clusters in your example above, it would look like:
>>
>> | CPU0/1 (cluster 0) L2 - 0x0
>> | CPU2/3 (cluster 1) L2 - 0x100
>> |                    L3 - 0x0
> 
> Thanks for the clarification.  I think I've got enough to wrap my head around
> this.  Let me think on it a bit to see if I can come up with a suggestion (we
> can debate how good it is).

Sounds good.


Thanks!

James
Jeffrey Hugo Oct. 10, 2018, 4:19 p.m. UTC | #5
On 10/8/2018 3:26 AM, James Morse wrote:
> Hi Jeffrey,
> 
> On 05/10/2018 17:39, Jeffrey Hugo wrote:
>> On 10/5/2018 9:54 AM, James Morse wrote:
>>> The problem is generating these numbers if only some of the CPUs are online, or
>>> if the acpi tables are generated by firmware at power-on and have a different
>>> layout every time.
>>> We don't even want to rely on linux's cpu numbering.
>>>
>>> The suggestion here is to use the smallest MPIDR, as that's as hardware property
>>> that won't change even if the tables are generated differently every boot.
>>
>> I can't think of a reason why affinity level 0 would ever change for a
> 
> affinity level of what? The caches? Sure, that should be impossible to change.
> 
> 
>> particular thread or core (SMT vs non-SMT), however none of the other affinity
>> levels have a well defined meaning (implementation dependent), and could very
>> well change boot to boot.
> 
> Ah, you mean the level numbers. Yes, these are (quasi) discovered from the
> table, so can't be relied on.
> 
> If you insert a new level then this would shuffle the user-visible indexes and
> levels. I would argue this is no longer the same hardware.

Well, so its common in my experience for x86 server hardware to allow 
the user to enable/disable SMT.  This seems to allow the user to 
configure the hardware to better suit their workloads.

Applying this to ARM, You might have a system where it starts in non-SMT 
mode:

Aff3: 0
Aff2: 0
Aff1: 0
Aff0: X

however, after changing the setting to enable SMT and a reboot:

Aff3: 0
Aff2: 0
Aff1: Y
Aff0: X

If you base the cache ids on MPIDR, then they would likely change SMT vs 
non-SMT for example.  However, you seem to consider that as different 
hardware, and therefore if the cache ids change, so be it.

> 
> Doing this may already break resctrl as the 'L2' and 'L3' numbers are part of
> the ABI.
> 
> The ids generated would still be unique for a level.
> 
> 
>> I would strongly avoid using MPIDR, particularly for the usecase you've described.
> 
> Is there an alternative? It needs to be a hardware property to insulate us from
> the tables being re-generated.
> 
> I agree the MPIDR numbers are likely to be ugly, (I don't really care...).
> The u32 id being full from day 1 is more of a problem.

I agree, it needs to be a hardware property, and for the purposes of 
uniquely identifying a core, I'm not seeing an alternative to MPIDR.  I 
think that MPIDR should be ok so long as we don't attempt to derive any 
additional information from it, other than its an opaque value which 
uniquely identifies a core (or thread).

Since you have a u32 cache id, and MPIDR is a u64, I'm not really 
thrilled with trying to jam MPIDR into the cache id.  Also, as you point 
out, even though its not guaranteed, SW is going to expect the cache ids 
are numbered 0, 1, 2, etc.  I've seen issues with virtio where MPIDR 
values were exposed via sysfs as unique CPU identifiers, and it ended up 
preventing VMs from initializing on the system.

I suspect identifying the list of MPIDRs in the system, sorting them, 
and then assigning cache ids based on the sorted list position would be 
the best solution, ignoring that implementing it in PPTT parsing is 
probably going to be painful.

> 
> 
>>> Assuming two clusters in your example above, it would look like:
>>>
>>> | CPU0/1 (cluster 0) L2 - 0x0
>>> | CPU2/3 (cluster 1) L2 - 0x100
>>> |                    L3 - 0x0
>>
>> Thanks for the clarification.  I think I've got enough to wrap my head around
>> this.  Let me think on it a bit to see if I can come up with a suggestion (we
>> can debate how good it is).
> 
> Sounds good.
> 
> 
> Thanks!