Message ID | cover.1738686764.git.maciej.wieczor-retman@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | kasan: x86: arm64: risc-v: KASAN tag-based mode for x86 | expand |
ARM64 supports MTE which is hardware support for tagging 16 byte granules and verification of tags in pointers all in hardware and on some platforms with *no* performance penalty since the tag is stored in the ECC areas of DRAM and verified at the same time as the ECC. Could we get support for that? This would allow us to enable tag checking in production systems without performance penalty and no memory overhead.
On 2/4/25 10:58, Christoph Lameter (Ampere) wrote: > ARM64 supports MTE which is hardware support for tagging 16 byte granules > and verification of tags in pointers all in hardware and on some platforms > with *no* performance penalty since the tag is stored in the ECC areas of > DRAM and verified at the same time as the ECC. > > Could we get support for that? This would allow us to enable tag checking > in production systems without performance penalty and no memory overhead. At least on the Intel side, there's no trajectory for doing something like the MTE architecture for memory tagging. The DRAM "ECC" area is in very high demand and if anything things are moving away from using ECC "bits" for anything other than actual ECC. Even the MKTME+integrity (used for TDX) metadata is probably going to find a new home at some point. This shouldn't be a surprise to anyone on cc here. If it is, you should probably be reaching out to Intel over your normal channels.
On 4 Feb 2025, at 18:58, Christoph Lameter (Ampere) <cl@gentwo.org> wrote: > ARM64 supports MTE which is hardware support for tagging 16 byte granules > and verification of tags in pointers all in hardware and on some platforms > with *no* performance penalty since the tag is stored in the ECC areas of > DRAM and verified at the same time as the ECC. > > Could we get support for that? This would allow us to enable tag checking > in production systems without performance penalty and no memory overhead. It’s not “no performance penalty”, there is a cost to tracking the MTE tags for checking. In asynchronous (or asymmetric) mode that’s not too bad, but in synchronous mode there is a significant overhead even with ECC. Normally on a store, once you’ve translated it and have the data, you can buffer it up and defer the actual write until some time later. If you hit in the L1 cache then that will probably be quite soon, but if you miss then you have to wait for the data to come back from lower levels of the hierarchy, potentially all the way out to DRAM. Or if you have a write-around cache then you just send it out to the next level when it’s ready. But now, if you have synchronous MTE, you cannot retire your store instruction until you know what the tag for the location you’re storing to is; effectively you have to wait until you can do the full cache lookup, and potentially miss, until it can retire. This puts pressure on the various microarchitectural structures that track instructions as they get executed, as instructions are now in flight for longer. Yes, it may well be that it is quicker for the memory controller to get the tags from ECC bits than via some other means, but you’re already paying many many cycles at that point, with the relevant store being stuck unable to retire (and thus every instruction after it in the instruction stream) that whole time, and no write allocate or write around schemes can help you, because you fundamentally have to wait for the tags to be read before you know if the instruction is going to trap. Now, you can choose to not use synchronous mode due to that overhead, but that’s nuance that isn’t considered by your reply here and has some consequences. Jess
On 4 Feb 2025, at 18:58, Christoph Lameter (Ampere) <cl@gentwo.org> wrote: > ARM64 supports MTE which is hardware support for tagging 16 byte granules > and verification of tags in pointers all in hardware and on some platforms > with *no* performance penalty since the tag is stored in the ECC areas of > DRAM and verified at the same time as the ECC. > > Could we get support for that? This would allow us to enable tag checking > in production systems without performance penalty and no memory overhead. It’s not “no performance penalty”, there is a cost to tracking the MTE tags for checking. In asynchronous (or asymmetric) mode that’s not too bad, but in synchronous mode there is a significant overhead even with ECC. Normally on a store, once you’ve translated it and have the data, you can buffer it up and defer the actual write until some time later. If you hit in the L1 cache then that will probably be quite soon, but if you miss then you have to wait for the data to come back from lower levels of the hierarchy, potentially all the way out to DRAM. Or if you have a write-around cache then you just send it out to the next level when it’s ready. But now, if you have synchronous MTE, you cannot retire your store instruction until you know what the tag for the location you’re storing to is; effectively you have to wait until you can do the full cache lookup, and potentially miss, until it can retire. This puts pressure on the various microarchitectural structures that track instructions as they get executed, as instructions are now in flight for longer. Yes, it may well be that it is quicker for the memory controller to get the tags from ECC bits than via some other means, but you’re already paying many many cycles at that point, with the relevant store being stuck unable to retire (and thus every instruction after it in the instruction stream) that whole time, and no write allocate or write around schemes can help you, because you fundamentally have to wait for the tags to be read before you know if the instruction is going to trap. Now, you can choose to not use synchronous mode due to that overhead, but that’s nuance that isn’t considered by your reply here and has some consequences. Jess
On Tue, 4 Feb 2025, Jessica Clarke wrote: > It’s not “no performance penalty”, there is a cost to tracking the MTE > tags for checking. In asynchronous (or asymmetric) mode that’s not too On Ampere Processor hardware there is no penalty since the logic is build into the usual read/write paths. This is by design. There may be on other platforms that cannot do this.
On Tue, 4 Feb 2025, Dave Hansen wrote: > > Could we get support for that? This would allow us to enable tag checking > > in production systems without performance penalty and no memory overhead. > > At least on the Intel side, there's no trajectory for doing something > like the MTE architecture for memory tagging. The DRAM "ECC" area is in > very high demand and if anything things are moving away from using ECC > "bits" for anything other than actual ECC. Even the MKTME+integrity > (used for TDX) metadata is probably going to find a new home at some point. > > This shouldn't be a surprise to anyone on cc here. If it is, you should > probably be reaching out to Intel over your normal channels. Intel was a competitor for our company and AFAICT has issues all over the place with performance given its conservative stands on technology. But we do not test against Intel anymore. Can someone from AMD say something? MTE tagging is part of the processor standard for ARM64 and Linux will need to support the 16 byte tagging feature one way or another even if Intel does not like it. And AFAICT hardware tagging support is a critical security feature for the future.